[ale] Speed un-tar?

Ed Cashin ecashin at noserose.net
Wed Jul 30 22:31:50 EDT 2014


I had been wanting to do something like this for a while, to try out the
multiprocessing library, so here's an example for fun:

  https://gist.github.com/ecashin/96dc9c3183e6e98db2fd#file-ptars-py


On Wed, Jul 30, 2014 at 8:57 PM, Ed Cashin <ecashin at noserose.net> wrote:

> The fact that the VM's storage is spread over 6 disks makes it sound like
> your best bet is to avoid seek times.
>
> So a tmpfs could conceivably help a lot.  By catting the whole files from
> storage into the tmpfs you avoid making the disk heads go back and forth
> more than necessary, and random access can occur in RAM.
>
> Then the trick is to get the data out of the tar file in RAM as quickly as
> possible.  You could use, e.g., the tarfile library from Python.
>
> If I knew more about how you need to process that data I could speculate
> about what that would look like.  I am guessing that you could throw the
> tar files into the tmpfs while you have room with one process and use
> Python's multiprocessing library,
>
>   https://docs.python.org/2/library/multiprocessing.html
>
> ... to manage multiple workers that would concurrently process the data
> from the tmpfs and then delete the tar files from the tmpfs.
>
> And if Python's not your bag, there are other ways to use tar in a
> program.  Often I have been able to get big speedups by simply doing lots
> of work in one (or 10---a fixed number, anyway) ruby or Python processes
> instead of launching a new process for each file.
>
>
>
> On Wed, Jul 30, 2014 at 10:08 AM, Robert L. Harris <
> robert.l.harris at gmail.com> wrote:
>
>> Maybe if I used smaller chunk loads but not sure what that would get me.
>>
>>
>> On Tue, Jul 29, 2014 at 10:33 PM, Jeff Hubbs <jhubbslist at att.net> wrote:
>>
>> > Do you have enough RAM to read from disk and write to a ramdisk or vice
>> > versa, whichever helps?
>> >
>> > On 7/29/14, 6:44 PM, Jim Kinney wrote:
>> >
>> >> Ugh. Sounds like you'll need to do it stages. Coarse grain search
>> written
>> >> to new files and a fine grained search on those new files.
>> >> On Jul 29, 2014 6:08 PM, "Robert L. Harris" <robert.l.harris at gmail.com
>> >
>> >> wrote:
>> >>
>> >>  Unfortunately I can't touch the VM's configuration or the hardware
>> >>> underneath it.  Supposedly I'm spread across a minimum of 6 "fast"
>> disks
>> >>> already.  I can't really go less than 10 files though as I am
>> concerned
>> >>> with information being spread across multiple files.  I was hoping
>> >>> someone
>> >>> knew a tool/util which would rip through the data faster I had not
>> found
>> >>> yet.
>> >>>
>> >>> Robert
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Jul 29, 2014 at 4:00 PM, Jim Kinney <jim.kinney at gmail.com>
>> >>> wrote:
>> >>>
>> >>>  unless you can spread that read/write load out over many, many
>> spindles,
>> >>>> you're stuck. Now add in the VMmust access through the virtual drive
>> >>>> process and you've got another performance hit.
>> >>>>
>> >>>> You _could_ add extra drives to the VM that are hosted on a decent
>> array
>> >>>> (fiber channel or LA network iSCSI), copy the files to the new home
>> in a
>> >>>> batch and hit the 4G RAM limit.
>> >>>>
>> >>>> If possible, can you add more RAM to that VM?
>> >>>>
>> >>>>
>> >>>> On Tue, Jul 29, 2014 at 5:10 PM, Robert L. Harris <
>> >>>> robert.l.harris at gmail.com
>> >>>>
>> >>>>> wrote:
>> >>>>> I'm working on a tool to parse through a lot of data for processing.
>> >>>>>
>> >>>>   Right
>> >>>>
>> >>>>> now it's taking longer than I wish it would so I'm trying to find
>> ways
>> >>>>>
>> >>>> to
>> >>>
>> >>>> improve the performance.  Right now it appears the biggest bottleneck
>> >>>>>
>> >>>> is
>> >>>
>> >>>> IO.  I'm looking at about 2000 directories which contain between 1
>> and
>> >>>>>
>> >>>> 200
>> >>>>
>> >>>>> files in tar.gz format on a VM with 4 Gigs of RAM.  I need to load
>> the
>> >>>>>
>> >>>> data
>> >>>>
>> >>>>> into an array to do some pre-processing cleanup so I am currently
>> >>>>>
>> >>>> chopping
>> >>>>
>> >>>>> the files in each of the directories into an array of groups of 10
>> >>>>>
>> >>>> files
>> >>>
>> >>>> at
>> >>>>
>> >>>>> a time ( seems to be the sweet spot to prevent swap ) and then a
>> >>>>>
>> >>>> straight
>> >>>
>> >>>> forward loop of which each iteration executes:
>> >>>>>
>> >>>>>    tar xzOf $Loop |
>> >>>>>
>> >>>>> and then pushes it into my array for processing.
>> >>>>>
>> >>>>> I have tried:
>> >>>>>
>> >>>>>   gzcat $Loop | tar xO |
>> >>>>>
>> >>>>> which is actually slower.  Yes, I'm at the point of trying to
>> squeeze
>> >>>>> seconds of time out of a group.  Any thoughts of a method which
>> might
>> >>>>>
>> >>>> be
>> >>>
>> >>>> quicker?
>> >>>>>
>> >>>>> Robert
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> :wq!
>> >>>>>
>> >>>>>  ------------------------------------------------------------
>> >>> ---------------
>> >>>
>> >>>> Robert L. Harris
>> >>>>>
>> >>>>> DISCLAIMER:
>> >>>>>        These are MY OPINIONS             With Dreams To Be A King,
>> >>>>>         ALONE.  I speak for                      First One Should
>> Be A
>> >>>>>
>> >>>> Man
>> >>>
>> >>>>         no-one else.                                     - Manowar
>> >>>>> -------------- next part --------------
>> >>>>> An HTML attachment was scrubbed...
>> >>>>> URL: <
>> >>>>>
>> >>>>>  http://mail.ale.org/pipermail/ale/attachments/20140729/
>> >>> 38cb3da3/attachment.html
>> >>>
>> >>>> _______________________________________________
>> >>>>> Ale mailing list
>> >>>>> Ale at ale.org
>> >>>>> http://mail.ale.org/mailman/listinfo/ale
>> >>>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> >>>>> http://mail.ale.org/mailman/listinfo
>> >>>>>
>> >>>>>
>> >>>>
>> >>>> --
>> >>>> --
>> >>>> James P. Kinney III
>> >>>>
>> >>>> Every time you stop a school, you will have to build a jail. What you
>> >>>>
>> >>> gain
>> >>>
>> >>>> at one end you lose at the other. It's like feeding a dog on his own
>> >>>>
>> >>> tail.
>> >>>
>> >>>> It won't fatten the dog.
>> >>>> - Speech 11/23/1900 Mark Twain
>> >>>>
>> >>>>
>> >>>> *http://heretothereideas.blogspot.com/
>> >>>> <http://heretothereideas.blogspot.com/>*
>> >>>> -------------- next part --------------
>> >>>> An HTML attachment was scrubbed...
>> >>>> URL: <
>> >>>>
>> >>>>  http://mail.ale.org/pipermail/ale/attachments/20140729/
>> >>> 385b6337/attachment.html
>> >>>
>> >>>> _______________________________________________
>> >>>> Ale mailing list
>> >>>> Ale at ale.org
>> >>>> http://mail.ale.org/mailman/listinfo/ale
>> >>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> >>>> http://mail.ale.org/mailman/listinfo
>> >>>>
>> >>>>
>> >>>
>> >>> --
>> >>> :wq!
>> >>> ------------------------------------------------------------
>> >>> ---------------
>> >>> Robert L. Harris
>> >>>
>> >>> DISCLAIMER:
>> >>>        These are MY OPINIONS             With Dreams To Be A King,
>> >>>         ALONE.  I speak for                      First One Should Be A
>> >>> Man
>> >>>         no-one else.                                     - Manowar
>> >>> -------------- next part --------------
>> >>> An HTML attachment was scrubbed...
>> >>> URL: <
>> >>> http://mail.ale.org/pipermail/ale/attachments/20140729/
>> >>> e382a9b2/attachment.html
>> >>> _______________________________________________
>> >>> Ale mailing list
>> >>> Ale at ale.org
>> >>> http://mail.ale.org/mailman/listinfo/ale
>> >>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> >>> http://mail.ale.org/mailman/listinfo
>> >>>
>> >>>  -------------- next part --------------
>> >> An HTML attachment was scrubbed...
>> >> URL: <http://mail.ale.org/pipermail/ale/attachments/
>> >> 20140729/4b9bfb79/attachment.html>
>> >> _______________________________________________
>> >> Ale mailing list
>> >> Ale at ale.org
>> >> http://mail.ale.org/mailman/listinfo/ale
>> >> See JOBS, ANNOUNCE and SCHOOLS lists at
>> >> http://mail.ale.org/mailman/listinfo
>> >>
>> >>
>> > _______________________________________________
>> > Ale mailing list
>> > Ale at ale.org
>> > http://mail.ale.org/mailman/listinfo/ale
>> > See JOBS, ANNOUNCE and SCHOOLS lists at
>> > http://mail.ale.org/mailman/listinfo
>> >
>>
>>
>>
>> --
>> :wq!
>>
>> ---------------------------------------------------------------------------
>> Robert L. Harris
>>
>> DISCLAIMER:
>>       These are MY OPINIONS             With Dreams To Be A King,
>>        ALONE.  I speak for                      First One Should Be A Man
>>        no-one else.                                     - Manowar
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://mail.ale.org/pipermail/ale/attachments/20140730/fcdf40c9/attachment.html
>> >
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
>
>
>
> --
>   Ed Cashin <ecashin at noserose.net>
>



-- 
  Ed Cashin <ecashin at noserose.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20140730/104c95d0/attachment.html>


More information about the Ale mailing list