yesthattom | Zip Code Optimization Algorithm

I have a list of every zip code in the US, and what state it appears in. I've been trying to shrink it down. For example, 07* and 08* are all in New Jersey. However, it's not just a 2-digit thing. 03 has to be broken down to 3 digits to figure out the state. 063 needs 4 digits, while the rest of 06 only needs 3 digits. It's pretty complicated, and the database changes regularly.

All last week I'd been toying with the idea of an optimizer. You'd feed it the complete list of zip codes and it would output the shortest list of regular expressions.

After battling too many false-starts (including some algorithms that were embarassingly wrong), I finally figured out that if I represent all the zipcodes as an 10-way tree that is 5-levels deep, I could do a recursive algorithm that would eliminate all children of a node, if all children lead to the same state.

The problem was that I didn't want to write trees in perl.

Then I realized that I didn't quite have to, and wrote the code in a short amount of time.

It turns out that out of 100,000 potential zip codes, there are 70,608 zip codes in use, and if you only want to know the state a zip code exists in, you only need 306 regular expressions.

So, you can have a huge database and do it in 1 hash lookup, or a small database and do it in about 2.5 lookups, or do a linear "case statement" and do it in (on average) 153 comparisons.

All this to avoid querying a SQL database that already exists ;-). However, for this application I really didn't want to have to recall how to speak to a SQL server and do a query. The application was very small and that would have doubled the size.

Creating a serious algorithm was amazing satisfying. I haven't written code like that in years. It so much different than, say, writing a web form or database application.

Flat | Top-Level Comments Only

From:

sweh.livejournal.com

I'm not sure what you are trying to optimise on. A plain text file of all zip codes and state (eg "12345 YY"), one per line would be 900Kb. On a P3-800 running Linux a simple grep takes ~0.006 seconds (looping the grep 1000 times give a "time" of approx 6 seconds, although admittedly it helped that the file was in the filesystem cache by that time :-)).

Executing a SQL query (even if you were already connected) is likely to be slower (and definitely slower if you had to connect to the database).

So what are you trying to optimise on?

yesthattom.livejournal.com

I didn't like the PHP script taking so long to compile what is, essentially, a 900kb hash initialization. Initializing a 300-line hash is much better-er.

You obviously know your application better than me, but given the time taken to call grep externally on a file and return the result, I'd probably have just used grep. Hmm, we can obviously optimise since you've said the first 4 digits are relevant in the worst case :-) Now the fork/exec time becames visible and 1000 greps took 2.9 seconds.

*shrug*

At my level of experience with PHP, I'm not sure I want to open a file or fork/system off a grep (and considering this is for a hosted solution, I really didn't want to take many risks). Now if this were perl or Mason, I'd be fine.

The job is done now, and I had a wonderful time writing a little optimizer. Don't be a buzzkill :-)

--Tom

lovingboth

By just having an array of bytes:

Array [0..MaxZipCode] of StateNumbers ;

you could do it in less than 100k of RAM, with lookup times being instant.

It would take longer to initialize it than to use :-)

If you are doing more than one lookup per invocation of the program then it might make sense to load the information into memory, but if there's a single run then you still have to get the zipcode data into the array in the first place and that's not instant (unless you are using a precompiled language with the array being static and initialised on definition of the array, but I was assuming some form of interpreted - or semi-interpreted - language like perl or ksh).

If you're only doing this once per program run, the time taken to invoke the program probably swamps everything else!

awfief.livejournal.com

However, for this application I really didn't want to have to recall how to speak to a SQL server and do a query.

Wow, there *are* things in the universe you don't know . . . that I do! wowie! :)

(My recent MySQL certification means more to me, somehow. . . though you still rock my socks off)

stormsweeper.livejournal.com

You could have gone relaly over the top and used XML-RPC to verify the address with the USPS's webtools. ;) Why do something the boring easy way when you can do it across the internet in XML?

madbodger.livejournal.com

That SO sounds like something I would do.

Let's get nekkid!

Yeah, algorithms get me hot :-)

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tom's Journal

Geeks and politics

Zip Code Optimization Algorithm

Zip Code Optimization Algorithm

no subject

Re:

Re:

Re:

Re:

Re:

Re:

Re:

no subject

no subject

You rock, dude!

Re: You rock, dude!

Profile

December 2015

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags