[ale] coding practices
Michael B. Trausch
mike at trausch.us
Wed Mar 3 23:45:49 EST 2010
On 03/03/2010 05:31 PM, Jim Kinney wrote:
> Third party also wants the total row count in the data file appended to
> the data file.
>
> This is where I disagree. I would far rather put the additional data in
> the done file and not alter the output in anyway.
>
> Granted, adding the row count is trivial (wc -l filename >> filename)
> and that last line will be nothing like the actual data lines. It does
> make reprocessing the data files more complicated as they have to be
> checked for the presence of the row count on the last line before
> rerunning the import process again.
>
> Other views?
I don't like that idea at all. If I am writing something to process
data, I treat the data in question as immutable. I would expect that
they should have no problem using a metadata file.
If it absolutely _must_ be in the same file, I would consider using a
structured storage file format (essentially, a dynamically growing,
file-oriented pseudo-filesystem) of some sort, or a ZIP file. I would
lean towards the former, as opposed to the latter, though, because of
the lower amount of overhead incurred to read and update the contents of
a structured storage file. Of course, neither of those options leave
you with a single file that is plain-text.
I can think of a few viable options. You could use DBM files, assuming
that they permit arbitrarily-sized values for a given key. You could
also use XML, if you're not allergic to it. I'd probably not, myself,
since most data has to be transformed in order to be embedded properly
in XML.
You could use something like a compound document format, though I'd say
that the best one to use in this case would likely be a light-weight,
home-brew sort of system that doesn't utilize any sort of compression or
other sorts of overhead-producing things. The added bonus there is that
you could use it like a primitive stream or block oriented key/value
storage system and add additional metadata to the file if needed.
Of course, whether or not you can do any of those things is up to the
people you're working for. I would at the _very_ least force the issue
that the original data should be immutable and that there should be some
other means of storing the metadata. The most important thing in doing
so is that there is a convention for doing it; the technical details as
to how don't matter as much.
--- Mike
--
Michael B. Trausch ☎ (404) 492-6475
More information about the Ale
mailing list