You ever have to deal with a database full of production data, and you have to write code that works with all of it? Or maybe you need to test to make sure that someone else’s code works with all of it? If you have millions, or hundreds of millions of records, it can be very daunting to figure out “Have I identified all of the use cases that apply?”
I first ran into this problem when working for TransUnion. They had a data set of 375 million records, give or take, that needed to be transformed and loaded into a database with a particular schema. The objective was to be able to use our search mechanism to find the records after they had been loaded. Sounds simple enough.
I started with the last name. What are all of the different last names I needed to worry about? I wrote a script to parse out the data set and build a histogram of the most common names. As to be expected, “Smith” came out on top!
But there were, if memory serves, about 64 million distinct values. Wow! Way too many to wrap my brain around. I had to simplify. To shorten the list. But how?
The main thing I was concerned about was punctuation. How does the punctuation differ between the records? Some names are simple, single word tokens, like Smith, Jones, Johnson, etc. Others are hyphenated, like Zeta-Jones or Smith-Barney. Others have multiple tokens, like De La Joya. But what else is lurking in that list of 64 million names that I had to worry about?
My solution was to write code to “fold” names that had a lot in common into the same value. I started by processing the histogram of 64 million names, and transforming the name by converting every upper-case letter to an ‘X’, and every lower-case letter to an ‘x’, and every digit to a ’9′. Then, I built a histogram of those. This was very instructive, but I saw that there were still lots of names that were clearly functionally equivalent. I decided that a series of the same character, 4 or more in a row, should be considered functionally equivalent, regardless of how many there were past 4. So I added just a little bit more code to my script, and the number of buckets dropped dramatically. Looking at the histogram, I could begin to see more and more relevant patterns that I had to take into consideration.
The actual work continued on from there, making the analysis and “folding” code more and more complex. But the essence of it is still the same: Transform the value into another value such that two functionally equivalent input values have the same output value. Or in other words, partition the data into Equivalence Classes. Use the class as the key in a histogram, and see what drops out. Repeat until comfortable that the proper categories have been identified. Then, the function that was used to make the transformation is a very concise record of some of the kinds of things to pay attention to in the data. And the histogram (with original values preserved so I knew which input yielded which bucket key) provides good test/sample data.
This is a principle that I have found opportunity to use time and time again. I’ve used it to identify patterns in strings, to strip out uniqueness from highly variable log output so it becomes more standardized and therefore more readily subject to statistical analysis. I’ve used it to identify unique combinations and/or permutations of record field values (more on that another time, maybe). Basically, since that day, there has hardly been a job that I’ve taken on where I didn’t have an opportunity to use this basic principle to help me do a better job of coding for or testing with real-world data.
Interested in learning more? I gave a talk to the Software Association of Oregon monthly meeting on the subject, for which I prepared some slides. Or, post a comment, and I’ll answer you directly.