[ale] Riddle me this awk man

Greg Freemyer greg.freemyer at gmail.com
Mon Feb 21 19:54:58 EST 2011


All, I sent Richard a 10,000 line sample.

Greg

On Mon, Feb 21, 2011 at 7:12 PM, Richard Bronosky <Richard at bronosky.com> wrote:
> If it's not sensitive information, I'd love to get my hands on a
> gzipped tar with the original file, the script, the outputs, etc. I
> really drill into this stuff. I use awk everyday of my life. I would
> like to know if there issues.
>
> On Mon, Feb 21, 2011 at 7:07 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
>> On Thu, Feb 17, 2011 at 8:41 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
>>> On Thu, Feb 17, 2011 at 8:15 PM, Geoffrey Myers
>>> <lists at serioustechnology.com> wrote:
>>>> Greg Freemyer wrote:
>>>>> It works in cygwin!!!!!!
>>>>>
>>>>> That may be a first for me.  A linux bug that does not exist in the
>>>>> cygwin version.
>>>>>
>>>>> It might be a awk vs. gawk thing.  I'll worry with it tomorrow.
>>>>
>>>> Then I'd say it's a bug in cygin.
>>>
>>> I'll have to try it in openSUSE tomorrow.  If it matches cygwin, then
>>> Ubuntu is the loser.
>>>
>>> If it matches Ubuntu, then I guess you could say the cygwin fails to
>>> duplicate the linux bugs present in awk!
>>>
>>> I'm not sure I want that good of a emulation!
>>>
>>> Greg
>>
>> I just ran this on openSUSE 11.3.
>>
>> It worked fine.  ie. One output line for each input line.
>>
>> So it is a Ubuntu issue of some sort that is eating almost 400K lines
>> of data out of my expected 500+K lines of output.
>>
>> In openSUSE awk is a link to /bin/gawk.  In Ubuntu it is a link to mawk.
>>
>> But I also tried mawk from openSUSE and it also gave me one line of
>> output per line of input.
>>
>> Looking at a diff between the two outputs it appears there are some
>> control chars in the input data set.  I can understand Ubuntu
>> mishandling those lines, but it apparently just goes bonkers.  At
>> first it just drops 10 or so output lines for each input line with
>> control chars
>>
>> But at line 174130 is just dies.
>>
>> Here's an intriguing part of the diff between the output file on
>> openSUSE and the same on Ubuntu that shows the line that finally did
>> Ubuntu's mawk in.
>>
>> Remember awk on openSUSE and cygwin are handling this data in at least
>> a more or less sane way.
>>
>> Also, looking visually at the first few hundred of the missing lines.
>> Only a handful of them have control chars in them.
>>
>> 174207,541986c174130
>> < 15-Aug-2007 14:41:14,0,macb,0,0,0,73800,[PDF Metadata]
>> (creationdate) User: þÿ^@k^@i^@m File created. Title :
>> (þÿ^@I^@n^@t^@u^@i^@t^@_^@Q^@B^@O^@B^@_^@I^@n^@t^@e^@r^@n^@a^@l^@.^@p^@d^@f)
>> Author: [þÿ^@k^@i^@m] Creator: [PFU ScanSnap Manager 4.0.11]
>> produced by: [Adobe PDF Scan Library 2.1] (file:
>> /mnt/windows7_mount//Documents and Settings/Administrator/Local
>> Settings/Temp/_tmpAT/attFEFB.tmp)
>> < 15-Aug-2007 15:11:02,0,macb,0,0,0,67976,[PDF Metadata]
>> (creationdate) User: Olde English Manor File created. Title :
>> (OEM-Rent Schedule-7-26-07.xls) Author: [Olde English Manor] Creator:
>> [Acrobat PDFMaker 7.0.7 for Excel] produced by: [Acrobat Distiller
>> 7.0.5 /(Windows/] (file: /mnt/windows7_mount//Documents and
>> Settings/Administrator/Local Settings/Temp/_tmpAT/att4111.tmp)
>> < 15-Aug-2007 15:11:04,0,macb,0,0,0,67976,[PDF Metadata] (moddate)
>> User: Olde English Manor File modified. Title : (OEM-Rent
>> Schedule-7-26-07.xls) Author: [Olde English Manor] Creator: [Acrobat
>> PDFMaker 7.0.7 for Excel] produced by: [Acrobat Distiller 7.0.5
>> /(Windows/] (file: /mnt/windows7_mount//Documents and
>> Settings/Administrator/Local Settings/Temp/_tmpAT/att4111.tmp)
>> < 15-Aug-2007 15:29:09,0,macb,0,0,0,73800,[PDF Metadata] (moddate)
>> User: þÿ^@k^@i^@m File modified. Title :
>> (þÿ^@I^@n^@t^@u^@i^@t^@_^@Q^@B^@O^@B^@_^@I^@n^@t^@e^@r^@n^@a^@l^@.^@p^@d^@f)
>> Author: [þÿ^@k^@i^@m] Creator: [PFU ScanSnap Manager 4.0.11]
>> produced by: [Adobe PDF Scan Library 2.1] (file:
>> /mnt/windows7_mount//Documents and Settings/Administrator/Local
>> Settings/Temp/_tmpAT/attFEFB.tmp)
>> < 15-Aug-2007 16:16:22,0,.acb,0,0,0,9630,[Internet Explorer] (Content
>> viewed/Content saved to drive)
>> URL:http://z-ecx.images-amazon.com/images/G/01/digital/sitb/js/prototype.1187147005._V28380147_.js
>> cache stored in: R7MT4AO6/prototype.1187147005._V28380147_[2].js -
>> HTTP/1.1 200 OK - Content-Length: 39057 - Content-Type:
>> application/x-javascript (file: /mnt/windows7_mount//Documents and
>> Settings/Administrator/Local Settings/Temporary Internet
>> Files/Content.IE5/index.dat)
>> < 15-Aug-2007 18:25:13,0,macb,0,0,0,67973,[PDF Metadata]
>> (creationdate) User: admin File created. Title : (Microsoft Word -
>> OEM-Owners Report-7-31-07.doc) Author: [admin] Creator: [PScript5.dll
>> Version 5.2.2] produced by: [Acrobat Distiller 7.0.5 /(Windows/]
>> (file: /mnt/windows7_mount//Documents and Settings/Administrator/Local
>> Settings/Temp/_tmpAT/att
>>
>> If someone thinks this is worth pursueing, I can send them the first
>> 10,000 lines of data from the original input file.
>>
>> Ubuntu's mawk only drops 36 lines of that in the output I think.  So
>> it's a more managable problem.
>>
>> Even though I'm pretty sure there is nothing proprietary in that
>> dataset, and definitely there is no client data, I still don't want to
>> see it posted somewhere public like a bugzilla would be.
>>
>> Greg
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
>
>
>
> --
> .!# RichardBronosky #!.
>



-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com



More information about the Ale mailing list