[ale] Dealing with really big log files....

Sun Mar 22 12:55:19 EDT 2009

Just some thoughts thrown out...

A 114GiB log file certainly will compress like mad, either via gzip or 
bzip2 - the former is faster to compute; the latter generally gives 
smaller output.  Once you've done that and pulled over the compressed 
copy for local use, use rsync -z to keep your local copy synced to the 
server's. 

You might want to consider loading the whole schmeer into an RDBMS 
locally for further analysis.

Kenneth Ratliff wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On Mar 22, 2009, at 10:15 AM, Greg Freemyer wrote:
>
>   
>> If you have the disk space and few hours to let it run, I would just
>> "split" that file into big chinks.  Maybe a million lines each.
>>     
>
> Well, I could just sed the range of lines I want out in the same time  
> frame, and keep the result in one log file as well, which is my  
> preference. I've got about 400 gigs of space left on the disk, so I've  
> got some room. I mean, I don't really care about the data that goes  
> before, that should have been vaporized to the ether long before, I  
> just need to isolate the section of the log I do want so I can parse  
> it and give an answer to a customer.
>
>   
>> I'd recommend the source and destination of your split command be on
>> different physical drives if you can manage it.  Even if that means
>> connecting up a external usb drive to hold the split files.
>>     
>
> Not a machine I have physical access to, sadly. I'd love to have a  
> local copy to play with and leave the original intact on the server,  
> but pulling 114 gigs across a transatlantic link is not really an  
> option at the moment.
>
>   
>> If you don't have the disk space, you could try something like:
>>
>> head -2000000 my_log_file | tail -50000 > /tmp/my_chunk_of_interest
>>
>> Or grep has a option to grab lines before and after a line that has
>> the pattern in it.
>>
>> Hopefully one of those 3 will work for you.
>>     
>
> mysql's log file is very annoying in that it doesn't lend itself to  
> easy grepping by line count. It doesn't time stamp every entry, it's  
> more of a heartbeat thing (like once a second or every couple seconds,  
> it injects the date and time in front of the process it's currently  
> running). There's no set number of lines between heartbeats, so one  
> heartbeat might have a 3 line select query, the next heartbeat might  
> be processing 20 different queries including a 20 line update.
>
> I do have a script that will step through the log file and parse out  
> what updates were made to what database and what table at what time,  
> but it craps out when run against the entire log file, so I'm mostly  
> just trying to pare the log file down to a size where it'll work with  
> my other tools :)
>
>   
>> FYI: I work with large binary data sets all the time, and we use split
>> to keep each chunk to 2 GB.  Not specifically needed anymore, but if
>> you have read error etc. if is just the one 2 GB chunk you have to
>> retrieve from backup.  if also affords you the ability to copy the
>> data to FAT32 filesystem for portability.
>>     
>
> Normally, we rotate logs nightly and keep about a weeks worth, so the  
> space or individual size comparisons are usually not an issue. In this  
> case, logrotate busted for mysql sometime back in November and the  
> beast just kept eating. 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.9 (Darwin)
>
> iEYEARECAAYFAknGUTIACgkQXzanDlV0VY53YgCgkJxWJK6AAOZ+c2QTPN/gYLJH
> v/YAoPZXNIBckyfhfbMGrAZ6TNEqcIxV
> =IOjT
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
>
>