[ale] Speed un-tar?

Ed Cashin ecashin at noserose.net
Wed Jul 30 20:57:49 EDT 2014


The fact that the VM's storage is spread over 6 disks makes it sound like
your best bet is to avoid seek times.

So a tmpfs could conceivably help a lot.  By catting the whole files from
storage into the tmpfs you avoid making the disk heads go back and forth
more than necessary, and random access can occur in RAM.

Then the trick is to get the data out of the tar file in RAM as quickly as
possible.  You could use, e.g., the tarfile library from Python.

If I knew more about how you need to process that data I could speculate
about what that would look like.  I am guessing that you could throw the
tar files into the tmpfs while you have room with one process and use
Python's multiprocessing library,

  https://docs.python.org/2/library/multiprocessing.html

... to manage multiple workers that would concurrently process the data
from the tmpfs and then delete the tar files from the tmpfs.

And if Python's not your bag, there are other ways to use tar in a program.
 Often I have been able to get big speedups by simply doing lots of work in
one (or 10---a fixed number, anyway) ruby or Python processes instead of
launching a new process for each file.



On Wed, Jul 30, 2014 at 10:08 AM, Robert L. Harris <
robert.l.harris at gmail.com> wrote:

> Maybe if I used smaller chunk loads but not sure what that would get me.
>
>
> On Tue, Jul 29, 2014 at 10:33 PM, Jeff Hubbs <jhubbslist at att.net> wrote:
>
> > Do you have enough RAM to read from disk and write to a ramdisk or vice
> > versa, whichever helps?
> >
> > On 7/29/14, 6:44 PM, Jim Kinney wrote:
> >
> >> Ugh. Sounds like you'll need to do it stages. Coarse grain search
> written
> >> to new files and a fine grained search on those new files.
> >> On Jul 29, 2014 6:08 PM, "Robert L. Harris" <robert.l.harris at gmail.com>
> >> wrote:
> >>
> >>  Unfortunately I can't touch the VM's configuration or the hardware
> >>> underneath it.  Supposedly I'm spread across a minimum of 6 "fast"
> disks
> >>> already.  I can't really go less than 10 files though as I am concerned
> >>> with information being spread across multiple files.  I was hoping
> >>> someone
> >>> knew a tool/util which would rip through the data faster I had not
> found
> >>> yet.
> >>>
> >>> Robert
> >>>
> >>>
> >>>
> >>> On Tue, Jul 29, 2014 at 4:00 PM, Jim Kinney <jim.kinney at gmail.com>
> >>> wrote:
> >>>
> >>>  unless you can spread that read/write load out over many, many
> spindles,
> >>>> you're stuck. Now add in the VMmust access through the virtual drive
> >>>> process and you've got another performance hit.
> >>>>
> >>>> You _could_ add extra drives to the VM that are hosted on a decent
> array
> >>>> (fiber channel or LA network iSCSI), copy the files to the new home
> in a
> >>>> batch and hit the 4G RAM limit.
> >>>>
> >>>> If possible, can you add more RAM to that VM?
> >>>>
> >>>>
> >>>> On Tue, Jul 29, 2014 at 5:10 PM, Robert L. Harris <
> >>>> robert.l.harris at gmail.com
> >>>>
> >>>>> wrote:
> >>>>> I'm working on a tool to parse through a lot of data for processing.
> >>>>>
> >>>>   Right
> >>>>
> >>>>> now it's taking longer than I wish it would so I'm trying to find
> ways
> >>>>>
> >>>> to
> >>>
> >>>> improve the performance.  Right now it appears the biggest bottleneck
> >>>>>
> >>>> is
> >>>
> >>>> IO.  I'm looking at about 2000 directories which contain between 1 and
> >>>>>
> >>>> 200
> >>>>
> >>>>> files in tar.gz format on a VM with 4 Gigs of RAM.  I need to load
> the
> >>>>>
> >>>> data
> >>>>
> >>>>> into an array to do some pre-processing cleanup so I am currently
> >>>>>
> >>>> chopping
> >>>>
> >>>>> the files in each of the directories into an array of groups of 10
> >>>>>
> >>>> files
> >>>
> >>>> at
> >>>>
> >>>>> a time ( seems to be the sweet spot to prevent swap ) and then a
> >>>>>
> >>>> straight
> >>>
> >>>> forward loop of which each iteration executes:
> >>>>>
> >>>>>    tar xzOf $Loop |
> >>>>>
> >>>>> and then pushes it into my array for processing.
> >>>>>
> >>>>> I have tried:
> >>>>>
> >>>>>   gzcat $Loop | tar xO |
> >>>>>
> >>>>> which is actually slower.  Yes, I'm at the point of trying to squeeze
> >>>>> seconds of time out of a group.  Any thoughts of a method which might
> >>>>>
> >>>> be
> >>>
> >>>> quicker?
> >>>>>
> >>>>> Robert
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> :wq!
> >>>>>
> >>>>>  ------------------------------------------------------------
> >>> ---------------
> >>>
> >>>> Robert L. Harris
> >>>>>
> >>>>> DISCLAIMER:
> >>>>>        These are MY OPINIONS             With Dreams To Be A King,
> >>>>>         ALONE.  I speak for                      First One Should Be
> A
> >>>>>
> >>>> Man
> >>>
> >>>>         no-one else.                                     - Manowar
> >>>>> -------------- next part --------------
> >>>>> An HTML attachment was scrubbed...
> >>>>> URL: <
> >>>>>
> >>>>>  http://mail.ale.org/pipermail/ale/attachments/20140729/
> >>> 38cb3da3/attachment.html
> >>>
> >>>> _______________________________________________
> >>>>> Ale mailing list
> >>>>> Ale at ale.org
> >>>>> http://mail.ale.org/mailman/listinfo/ale
> >>>>> See JOBS, ANNOUNCE and SCHOOLS lists at
> >>>>> http://mail.ale.org/mailman/listinfo
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> --
> >>>> James P. Kinney III
> >>>>
> >>>> Every time you stop a school, you will have to build a jail. What you
> >>>>
> >>> gain
> >>>
> >>>> at one end you lose at the other. It's like feeding a dog on his own
> >>>>
> >>> tail.
> >>>
> >>>> It won't fatten the dog.
> >>>> - Speech 11/23/1900 Mark Twain
> >>>>
> >>>>
> >>>> *http://heretothereideas.blogspot.com/
> >>>> <http://heretothereideas.blogspot.com/>*
> >>>> -------------- next part --------------
> >>>> An HTML attachment was scrubbed...
> >>>> URL: <
> >>>>
> >>>>  http://mail.ale.org/pipermail/ale/attachments/20140729/
> >>> 385b6337/attachment.html
> >>>
> >>>> _______________________________________________
> >>>> Ale mailing list
> >>>> Ale at ale.org
> >>>> http://mail.ale.org/mailman/listinfo/ale
> >>>> See JOBS, ANNOUNCE and SCHOOLS lists at
> >>>> http://mail.ale.org/mailman/listinfo
> >>>>
> >>>>
> >>>
> >>> --
> >>> :wq!
> >>> ------------------------------------------------------------
> >>> ---------------
> >>> Robert L. Harris
> >>>
> >>> DISCLAIMER:
> >>>        These are MY OPINIONS             With Dreams To Be A King,
> >>>         ALONE.  I speak for                      First One Should Be A
> >>> Man
> >>>         no-one else.                                     - Manowar
> >>> -------------- next part --------------
> >>> An HTML attachment was scrubbed...
> >>> URL: <
> >>> http://mail.ale.org/pipermail/ale/attachments/20140729/
> >>> e382a9b2/attachment.html
> >>> _______________________________________________
> >>> Ale mailing list
> >>> Ale at ale.org
> >>> http://mail.ale.org/mailman/listinfo/ale
> >>> See JOBS, ANNOUNCE and SCHOOLS lists at
> >>> http://mail.ale.org/mailman/listinfo
> >>>
> >>>  -------------- next part --------------
> >> An HTML attachment was scrubbed...
> >> URL: <http://mail.ale.org/pipermail/ale/attachments/
> >> 20140729/4b9bfb79/attachment.html>
> >> _______________________________________________
> >> Ale mailing list
> >> Ale at ale.org
> >> http://mail.ale.org/mailman/listinfo/ale
> >> See JOBS, ANNOUNCE and SCHOOLS lists at
> >> http://mail.ale.org/mailman/listinfo
> >>
> >>
> > _______________________________________________
> > Ale mailing list
> > Ale at ale.org
> > http://mail.ale.org/mailman/listinfo/ale
> > See JOBS, ANNOUNCE and SCHOOLS lists at
> > http://mail.ale.org/mailman/listinfo
> >
>
>
>
> --
> :wq!
> ---------------------------------------------------------------------------
> Robert L. Harris
>
> DISCLAIMER:
>       These are MY OPINIONS             With Dreams To Be A King,
>        ALONE.  I speak for                      First One Should Be A Man
>        no-one else.                                     - Manowar
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.ale.org/pipermail/ale/attachments/20140730/fcdf40c9/attachment.html
> >
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>



-- 
  Ed Cashin <ecashin at noserose.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20140730/f4503737/attachment.html>


More information about the Ale mailing list