[ale] Riddle me this awk man

Greg Freemyer greg.freemyer at gmail.com
Mon Feb 21 19:07:13 EST 2011


On Thu, Feb 17, 2011 at 8:41 PM, Greg Freemyer <greg.freemyer at gmail.com> wrote:
> On Thu, Feb 17, 2011 at 8:15 PM, Geoffrey Myers
> <lists at serioustechnology.com> wrote:
>> Greg Freemyer wrote:
>>> It works in cygwin!!!!!!
>>>
>>> That may be a first for me.  A linux bug that does not exist in the
>>> cygwin version.
>>>
>>> It might be a awk vs. gawk thing.  I'll worry with it tomorrow.
>>
>> Then I'd say it's a bug in cygin.
>
> I'll have to try it in openSUSE tomorrow.  If it matches cygwin, then
> Ubuntu is the loser.
>
> If it matches Ubuntu, then I guess you could say the cygwin fails to
> duplicate the linux bugs present in awk!
>
> I'm not sure I want that good of a emulation!
>
> Greg

I just ran this on openSUSE 11.3.

It worked fine.  ie. One output line for each input line.

So it is a Ubuntu issue of some sort that is eating almost 400K lines
of data out of my expected 500+K lines of output.

In openSUSE awk is a link to /bin/gawk.  In Ubuntu it is a link to mawk.

But I also tried mawk from openSUSE and it also gave me one line of
output per line of input.

Looking at a diff between the two outputs it appears there are some
control chars in the input data set.  I can understand Ubuntu
mishandling those lines, but it apparently just goes bonkers.  At
first it just drops 10 or so output lines for each input line with
control chars

But at line 174130 is just dies.

Here's an intriguing part of the diff between the output file on
openSUSE and the same on Ubuntu that shows the line that finally did
Ubuntu's mawk in.

Remember awk on openSUSE and cygwin are handling this data in at least
a more or less sane way.

Also, looking visually at the first few hundred of the missing lines.
Only a handful of them have control chars in them.

174207,541986c174130
< 15-Aug-2007 14:41:14,0,macb,0,0,0,73800,[PDF Metadata]
(creationdate) User: þÿ^@k^@i^@m File created. Title :
(þÿ^@I^@n^@t^@u^@i^@t^@_^@Q^@B^@O^@B^@_^@I^@n^@t^@e^@r^@n^@a^@l^@.^@p^@d^@f)
Author: [þÿ^@k^@i^@m] Creator: [PFU ScanSnap Manager 4.0.11]
produced by: [Adobe PDF Scan Library 2.1] (file:
/mnt/windows7_mount//Documents and Settings/Administrator/Local
Settings/Temp/_tmpAT/attFEFB.tmp)
< 15-Aug-2007 15:11:02,0,macb,0,0,0,67976,[PDF Metadata]
(creationdate) User: Olde English Manor File created. Title :
(OEM-Rent Schedule-7-26-07.xls) Author: [Olde English Manor] Creator:
[Acrobat PDFMaker 7.0.7 for Excel] produced by: [Acrobat Distiller
7.0.5 /(Windows/] (file: /mnt/windows7_mount//Documents and
Settings/Administrator/Local Settings/Temp/_tmpAT/att4111.tmp)
< 15-Aug-2007 15:11:04,0,macb,0,0,0,67976,[PDF Metadata] (moddate)
User: Olde English Manor File modified. Title : (OEM-Rent
Schedule-7-26-07.xls) Author: [Olde English Manor] Creator: [Acrobat
PDFMaker 7.0.7 for Excel] produced by: [Acrobat Distiller 7.0.5
/(Windows/] (file: /mnt/windows7_mount//Documents and
Settings/Administrator/Local Settings/Temp/_tmpAT/att4111.tmp)
< 15-Aug-2007 15:29:09,0,macb,0,0,0,73800,[PDF Metadata] (moddate)
User: þÿ^@k^@i^@m File modified. Title :
(þÿ^@I^@n^@t^@u^@i^@t^@_^@Q^@B^@O^@B^@_^@I^@n^@t^@e^@r^@n^@a^@l^@.^@p^@d^@f)
Author: [þÿ^@k^@i^@m] Creator: [PFU ScanSnap Manager 4.0.11]
produced by: [Adobe PDF Scan Library 2.1] (file:
/mnt/windows7_mount//Documents and Settings/Administrator/Local
Settings/Temp/_tmpAT/attFEFB.tmp)
< 15-Aug-2007 16:16:22,0,.acb,0,0,0,9630,[Internet Explorer] (Content
viewed/Content saved to drive)
URL:http://z-ecx.images-amazon.com/images/G/01/digital/sitb/js/prototype.1187147005._V28380147_.js
cache stored in: R7MT4AO6/prototype.1187147005._V28380147_[2].js -
HTTP/1.1 200 OK - Content-Length: 39057 - Content-Type:
application/x-javascript (file: /mnt/windows7_mount//Documents and
Settings/Administrator/Local Settings/Temporary Internet
Files/Content.IE5/index.dat)
< 15-Aug-2007 18:25:13,0,macb,0,0,0,67973,[PDF Metadata]
(creationdate) User: admin File created. Title : (Microsoft Word -
OEM-Owners Report-7-31-07.doc) Author: [admin] Creator: [PScript5.dll
Version 5.2.2] produced by: [Acrobat Distiller 7.0.5 /(Windows/]
(file: /mnt/windows7_mount//Documents and Settings/Administrator/Local
Settings/Temp/_tmpAT/att

If someone thinks this is worth pursueing, I can send them the first
10,000 lines of data from the original input file.

Ubuntu's mawk only drops 36 lines of that in the output I think.  So
it's a more managable problem.

Even though I'm pretty sure there is nothing proprietary in that
dataset, and definitely there is no client data, I still don't want to
see it posted somewhere public like a bugzilla would be.

Greg



More information about the Ale mailing list