yesthattom | Zip Code Optimization Algorithm (Reply)

I have a list of every zip code in the US, and what state it appears in. I've been trying to shrink it down. For example, 07* and 08* are all in New Jersey. However, it's not just a 2-digit thing. 03 has to be broken down to 3 digits to figure out the state. 063 needs 4 digits, while the rest of 06 only needs 3 digits. It's pretty complicated, and the database changes regularly.

All last week I'd been toying with the idea of an optimizer. You'd feed it the complete list of zip codes and it would output the shortest list of regular expressions.

After battling too many false-starts (including some algorithms that were embarassingly wrong), I finally figured out that if I represent all the zipcodes as an 10-way tree that is 5-levels deep, I could do a recursive algorithm that would eliminate all children of a node, if all children lead to the same state.

The problem was that I didn't want to write trees in perl.

Then I realized that I didn't quite have to, and wrote the code in a short amount of time.

It turns out that out of 100,000 potential zip codes, there are 70,608 zip codes in use, and if you only want to know the state a zip code exists in, you only need 306 regular expressions.

So, you can have a huge database and do it in 1 hash lookup, or a small database and do it in about 2.5 lookups, or do a linear "case statement" and do it in (on average) 153 comparisons.

All this to avoid querying a SQL database that already exists ;-). However, for this application I really didn't want to have to recall how to speak to a SQL server and do a query. The application was very small and that would have doubled the size.

Creating a serious algorithm was amazing satisfying. I haven't written code like that in years. It so much different than, say, writing a web form or database application.