[ale] [Sorta-OT] Some interesting stats, and a lesson...
Michael B. Trausch
mike at trausch.us
Thu Oct 14 14:00:35 EDT 2010
So today I needed to generate a list of invalid SSNs for the purpose of
creating a testing database with some data in it that includes SSN data.
The government is, after 2011, increasing the range of allowed SSNs,
such that they shall follow the following invariants:
* Numbers starting with 000, 666, and 900-999 are invalid.
* Numbers with only zeroes in any segment are invalid.
So, I thought I would use the shell to generate this list. After all,
it's just three loops and an printf command, should be pretty fast,
right?
Nope. After waiting 45 minutes or so, I decided to write a C program to
do it. Twice. Because I wanted to see the difference between the size
of the data file in an unpacked binary format vs. a formatted ASCII
format. It wasn't finished when I got the results from the C program,
so I killed it.
It took 14.963 seconds wall-clock time for a C program to generate a
binary file with every invalid SSN in it, and that file is 555,349,000
bytes long (each "record" is fixed-length, a uint16_t, a uint8_t, and a
uint16_t). Of course, if I were going to use this in a real
application, I would probably pack the values better and eliminate a lot
of the redundant data, ensuring that the file were ordered, indexed, and
so forth, so it could potentially be a lot smaller. Note that this does
not include _every_ potential invalid
It took 50.945 seconds (again, wall-clock time) for a C program to
generate a formatted ASCII file with every such number in it, and that
file is 1,332,837,600 bytes long. Again, wildly inefficient, it's just
a list of "%03d-%02d-%04d\n" entries.
The lesson, of course, is to not use the shell to do a job that would be
much better suited to a simple C99 program...
Honestly, I would not have expected this result. Even though I know
that it takes the shell 7 children to do what I asked it to do, I
figured that I'd be able to generate the sample dataset in a reasonable
amount of time. I guess I was wrong. And the shell command wasn't even
generating as comprehensive a list. It was omitting the invalid numbers
that could be generated due to the second rule listed above.
--- Mike
More information about the Ale
mailing list