yesthattom: (2003blurry)
[personal profile] yesthattom
I have a list of every zip code in the US, and what state it appears in. I've been trying to shrink it down. For example, 07* and 08* are all in New Jersey. However, it's not just a 2-digit thing. 03 has to be broken down to 3 digits to figure out the state. 063 needs 4 digits, while the rest of 06 only needs 3 digits. It's pretty complicated, and the database changes regularly.

All last week I'd been toying with the idea of an optimizer. You'd feed it the complete list of zip codes and it would output the shortest list of regular expressions.

After battling too many false-starts (including some algorithms that were embarassingly wrong), I finally figured out that if I represent all the zipcodes as an 10-way tree that is 5-levels deep, I could do a recursive algorithm that would eliminate all children of a node, if all children lead to the same state.

The problem was that I didn't want to write trees in perl.

Then I realized that I didn't quite have to, and wrote the code in a short amount of time.

It turns out that out of 100,000 potential zip codes, there are 70,608 zip codes in use, and if you only want to know the state a zip code exists in, you only need 306 regular expressions.

So, you can have a huge database and do it in 1 hash lookup, or a small database and do it in about 2.5 lookups, or do a linear "case statement" and do it in (on average) 153 comparisons.

All this to avoid querying a SQL database that already exists ;-). However, for this application I really didn't want to have to recall how to speak to a SQL server and do a query. The application was very small and that would have doubled the size.

Creating a serious algorithm was amazing satisfying. I haven't written code like that in years. It so much different than, say, writing a web form or database application.

Date: 2004-02-09 09:37 am (UTC)
From: [identity profile] sweh.livejournal.com
I'm not sure what you are trying to optimise on. A plain text file of all zip codes and state (eg "12345 YY"), one per line would be 900Kb. On a P3-800 running Linux a simple grep takes ~0.006 seconds (looping the grep 1000 times give a "time" of approx 6 seconds, although admittedly it helped that the file was in the filesystem cache by that time :-)).

Executing a SQL query (even if you were already connected) is likely to be slower (and definitely slower if you had to connect to the database).

So what are you trying to optimise on?

Re:

Date: 2004-02-09 09:54 am (UTC)
From: [identity profile] yesthattom.livejournal.com
I didn't like the PHP script taking so long to compile what is, essentially, a 900kb hash initialization. Initializing a 300-line hash is much better-er.

Re:

Date: 2004-02-09 10:05 am (UTC)
From: [identity profile] sweh.livejournal.com
You obviously know your application better than me, but given the time taken to call grep externally on a file and return the result, I'd probably have just used grep. Hmm, we can obviously optimise since you've said the first 4 digits are relevant in the worst case :-) Now the fork/exec time becames visible and 1000 greps took 2.9 seconds.

*shrug*

Re:

Date: 2004-02-09 10:14 am (UTC)
From: [identity profile] yesthattom.livejournal.com
At my level of experience with PHP, I'm not sure I want to open a file or fork/system off a grep (and considering this is for a hosted solution, I really didn't want to take many risks). Now if this were perl or Mason, I'd be fine.

The job is done now, and I had a wonderful time writing a little optimizer. Don't be a buzzkill :-)

--Tom

Re:

Date: 2004-02-09 10:33 am (UTC)
lovingboth: (Default)
From: [personal profile] lovingboth
By just having an array of bytes:

Array [0..MaxZipCode] of StateNumbers ;

you could do it in less than 100k of RAM, with lookup times being instant.

Re:

Date: 2004-02-09 10:36 am (UTC)
From: [identity profile] yesthattom.livejournal.com
It would take longer to initialize it than to use :-)

Re:

Date: 2004-02-09 10:50 am (UTC)
From: [identity profile] sweh.livejournal.com
If you are doing more than one lookup per invocation of the program then it might make sense to load the information into memory, but if there's a single run then you still have to get the zipcode data into the array in the first place and that's not instant (unless you are using a precompiled language with the array being static and initialised on definition of the array, but I was assuming some form of interpreted - or semi-interpreted - language like perl or ksh).

Re:

Date: 2004-02-09 02:13 pm (UTC)
lovingboth: (Default)
From: [personal profile] lovingboth
If you're only doing this once per program run, the time taken to invoke the program probably swamps everything else!

Date: 2004-02-09 10:21 am (UTC)
From: [identity profile] awfief.livejournal.com
However, for this application I really didn't want to have to recall how to speak to a SQL server and do a query.

Wow, there *are* things in the universe you don't know . . . that I do! wowie! :)

(My recent MySQL certification means more to me, somehow. . . though you still rock my socks off)

Date: 2004-02-09 01:37 pm (UTC)
From: [identity profile] stormsweeper.livejournal.com
You could have gone relaly over the top and used XML-RPC to verify the address with the USPS's webtools. ;) Why do something the boring easy way when you can do it across the internet in XML?

You rock, dude!

Date: 2004-02-11 04:20 pm (UTC)
From: [identity profile] madbodger.livejournal.com
That SO sounds like something I would do.

Let's get nekkid!

Re: You rock, dude!

Date: 2004-02-12 06:39 am (UTC)
From: [identity profile] yesthattom.livejournal.com
Yeah, algorithms get me hot :-)

December 2015

S M T W T F S
  12345
6789 101112
13141516171819
202122 23242526
2728293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 7th, 2026 07:45 am
Powered by Dreamwidth Studios