<div dir="ltr">I had been wanting to do something like this for a while, to try out the multiprocessing library, so here's an example for fun:<div><br></div><div> <a href="https://gist.github.com/ecashin/96dc9c3183e6e98db2fd#file-ptars-py">https://gist.github.com/ecashin/96dc9c3183e6e98db2fd#file-ptars-py</a></div>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jul 30, 2014 at 8:57 PM, Ed Cashin <span dir="ltr"><<a href="mailto:ecashin@noserose.net" target="_blank">ecashin@noserose.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">The fact that the VM's storage is spread over 6 disks makes it sound like your best bet is to avoid seek times.<div>
<br></div><div>So a tmpfs could conceivably help a lot. By catting the whole files from storage into the tmpfs you avoid making the disk heads go back and forth more than necessary, and random access can occur in RAM.</div>
<div><br></div><div>Then the trick is to get the data out of the tar file in RAM as quickly as possible. You could use, e.g., the tarfile library from Python.</div><div><br></div><div>If I knew more about how you need to process that data I could speculate about what that would look like. I am guessing that you could throw the tar files into the tmpfs while you have room with one process and use Python's multiprocessing library,</div>
<div><br></div><div> <a href="https://docs.python.org/2/library/multiprocessing.html" target="_blank">https://docs.python.org/2/library/multiprocessing.html</a></div><div><br></div><div>... to manage multiple workers that would concurrently process the data from the tmpfs and then delete the tar files from the tmpfs.</div>
<div><br></div><div>And if Python's not your bag, there are other ways to use tar in a program. Often I have been able to get big speedups by simply doing lots of work in one (or 10---a fixed number, anyway) ruby or Python processes instead of launching a new process for each file.</div>
<div><br></div></div><div class="gmail_extra"><div><div class="h5"><br><br><div class="gmail_quote">On Wed, Jul 30, 2014 at 10:08 AM, Robert L. Harris <span dir="ltr"><<a href="mailto:robert.l.harris@gmail.com" target="_blank">robert.l.harris@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Maybe if I used smaller chunk loads but not sure what that would get me.<br>
<div><div><br>
<br>
On Tue, Jul 29, 2014 at 10:33 PM, Jeff Hubbs <<a href="mailto:jhubbslist@att.net" target="_blank">jhubbslist@att.net</a>> wrote:<br>
<br>
> Do you have enough RAM to read from disk and write to a ramdisk or vice<br>
> versa, whichever helps?<br>
><br>
> On 7/29/14, 6:44 PM, Jim Kinney wrote:<br>
><br>
>> Ugh. Sounds like you'll need to do it stages. Coarse grain search written<br>
>> to new files and a fine grained search on those new files.<br>
>> On Jul 29, 2014 6:08 PM, "Robert L. Harris" <<a href="mailto:robert.l.harris@gmail.com" target="_blank">robert.l.harris@gmail.com</a>><br>
>> wrote:<br>
>><br>
>> Unfortunately I can't touch the VM's configuration or the hardware<br>
>>> underneath it. Supposedly I'm spread across a minimum of 6 "fast" disks<br>
>>> already. I can't really go less than 10 files though as I am concerned<br>
>>> with information being spread across multiple files. I was hoping<br>
>>> someone<br>
>>> knew a tool/util which would rip through the data faster I had not found<br>
>>> yet.<br>
>>><br>
>>> Robert<br>
>>><br>
>>><br>
>>><br>
>>> On Tue, Jul 29, 2014 at 4:00 PM, Jim Kinney <<a href="mailto:jim.kinney@gmail.com" target="_blank">jim.kinney@gmail.com</a>><br>
>>> wrote:<br>
>>><br>
>>> unless you can spread that read/write load out over many, many spindles,<br>
>>>> you're stuck. Now add in the VMmust access through the virtual drive<br>
>>>> process and you've got another performance hit.<br>
>>>><br>
>>>> You _could_ add extra drives to the VM that are hosted on a decent array<br>
>>>> (fiber channel or LA network iSCSI), copy the files to the new home in a<br>
>>>> batch and hit the 4G RAM limit.<br>
>>>><br>
>>>> If possible, can you add more RAM to that VM?<br>
>>>><br>
>>>><br>
>>>> On Tue, Jul 29, 2014 at 5:10 PM, Robert L. Harris <<br>
>>>> <a href="mailto:robert.l.harris@gmail.com" target="_blank">robert.l.harris@gmail.com</a><br>
>>>><br>
>>>>> wrote:<br>
>>>>> I'm working on a tool to parse through a lot of data for processing.<br>
>>>>><br>
>>>> Right<br>
>>>><br>
>>>>> now it's taking longer than I wish it would so I'm trying to find ways<br>
>>>>><br>
>>>> to<br>
>>><br>
>>>> improve the performance. Right now it appears the biggest bottleneck<br>
>>>>><br>
>>>> is<br>
>>><br>
>>>> IO. I'm looking at about 2000 directories which contain between 1 and<br>
>>>>><br>
>>>> 200<br>
>>>><br>
>>>>> files in tar.gz format on a VM with 4 Gigs of RAM. I need to load the<br>
>>>>><br>
>>>> data<br>
>>>><br>
>>>>> into an array to do some pre-processing cleanup so I am currently<br>
>>>>><br>
>>>> chopping<br>
>>>><br>
>>>>> the files in each of the directories into an array of groups of 10<br>
>>>>><br>
>>>> files<br>
>>><br>
>>>> at<br>
>>>><br>
>>>>> a time ( seems to be the sweet spot to prevent swap ) and then a<br>
>>>>><br>
>>>> straight<br>
>>><br>
>>>> forward loop of which each iteration executes:<br>
>>>>><br>
>>>>> tar xzOf $Loop |<br>
>>>>><br>
>>>>> and then pushes it into my array for processing.<br>
>>>>><br>
>>>>> I have tried:<br>
>>>>><br>
>>>>> gzcat $Loop | tar xO |<br>
>>>>><br>
>>>>> which is actually slower. Yes, I'm at the point of trying to squeeze<br>
>>>>> seconds of time out of a group. Any thoughts of a method which might<br>
>>>>><br>
>>>> be<br>
>>><br>
>>>> quicker?<br>
>>>>><br>
>>>>> Robert<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> --<br>
>>>>> :wq!<br>
>>>>><br>
>>>>> ------------------------------------------------------------<br>
>>> ---------------<br>
>>><br>
>>>> Robert L. Harris<br>
>>>>><br>
>>>>> DISCLAIMER:<br>
>>>>> These are MY OPINIONS With Dreams To Be A King,<br>
>>>>> ALONE. I speak for First One Should Be A<br>
>>>>><br>
>>>> Man<br>
>>><br>
>>>> no-one else. - Manowar<br>
>>>>> -------------- next part --------------<br>
>>>>> An HTML attachment was scrubbed...<br>
>>>>> URL: <<br>
>>>>><br>
>>>>> <a href="http://mail.ale.org/pipermail/ale/attachments/20140729/" target="_blank">http://mail.ale.org/pipermail/ale/attachments/20140729/</a><br>
>>> 38cb3da3/attachment.html<br>
>>><br>
>>>> _______________________________________________<br>
>>>>> Ale mailing list<br>
>>>>> <a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
>>>>> <a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
>>>>> See JOBS, ANNOUNCE and SCHOOLS lists at<br>
>>>>> <a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
>>>>><br>
>>>>><br>
>>>><br>
>>>> --<br>
>>>> --<br>
>>>> James P. Kinney III<br>
>>>><br>
>>>> Every time you stop a school, you will have to build a jail. What you<br>
>>>><br>
>>> gain<br>
>>><br>
>>>> at one end you lose at the other. It's like feeding a dog on his own<br>
>>>><br>
>>> tail.<br>
>>><br>
>>>> It won't fatten the dog.<br>
>>>> - Speech 11/23/1900 Mark Twain<br>
>>>><br>
>>>><br>
>>>> *<a href="http://heretothereideas.blogspot.com/" target="_blank">http://heretothereideas.blogspot.com/</a><br>
>>>> <<a href="http://heretothereideas.blogspot.com/" target="_blank">http://heretothereideas.blogspot.com/</a>>*<br>
>>>> -------------- next part --------------<br>
>>>> An HTML attachment was scrubbed...<br>
>>>> URL: <<br>
>>>><br>
>>>> <a href="http://mail.ale.org/pipermail/ale/attachments/20140729/" target="_blank">http://mail.ale.org/pipermail/ale/attachments/20140729/</a><br>
>>> 385b6337/attachment.html<br>
>>><br>
>>>> _______________________________________________<br>
>>>> Ale mailing list<br>
>>>> <a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
>>>> <a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
>>>> See JOBS, ANNOUNCE and SCHOOLS lists at<br>
>>>> <a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
>>>><br>
>>>><br>
>>><br>
>>> --<br>
>>> :wq!<br>
>>> ------------------------------------------------------------<br>
>>> ---------------<br>
>>> Robert L. Harris<br>
>>><br>
>>> DISCLAIMER:<br>
>>> These are MY OPINIONS With Dreams To Be A King,<br>
>>> ALONE. I speak for First One Should Be A<br>
>>> Man<br>
>>> no-one else. - Manowar<br>
>>> -------------- next part --------------<br>
>>> An HTML attachment was scrubbed...<br>
>>> URL: <<br>
>>> <a href="http://mail.ale.org/pipermail/ale/attachments/20140729/" target="_blank">http://mail.ale.org/pipermail/ale/attachments/20140729/</a><br>
</div></div><div>>>> e382a9b2/attachment.html<br>
>>> _______________________________________________<br>
>>> Ale mailing list<br>
>>> <a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
>>> <a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
>>> See JOBS, ANNOUNCE and SCHOOLS lists at<br>
>>> <a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
>>><br>
>>> -------------- next part --------------<br>
>> An HTML attachment was scrubbed...<br>
>> URL: <<a href="http://mail.ale.org/pipermail/ale/attachments/" target="_blank">http://mail.ale.org/pipermail/ale/attachments/</a><br>
>> 20140729/4b9bfb79/attachment.html><br>
>> _______________________________________________<br>
>> Ale mailing list<br>
>> <a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
>> <a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
>> See JOBS, ANNOUNCE and SCHOOLS lists at<br>
>> <a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
>><br>
>><br>
> _______________________________________________<br>
> Ale mailing list<br>
> <a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
> <a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
> See JOBS, ANNOUNCE and SCHOOLS lists at<br>
> <a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
><br>
<br>
<br>
<br>
</div><div>--<br>
:wq!<br>
---------------------------------------------------------------------------<br>
Robert L. Harris<br>
<br>
DISCLAIMER:<br>
These are MY OPINIONS With Dreams To Be A King,<br>
ALONE. I speak for First One Should Be A Man<br>
no-one else. - Manowar<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
</div>URL: <<a href="http://mail.ale.org/pipermail/ale/attachments/20140730/fcdf40c9/attachment.html" target="_blank">http://mail.ale.org/pipermail/ale/attachments/20140730/fcdf40c9/attachment.html</a>><br>
<div><div>_______________________________________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/listinfo</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div></div></div><span class="HOEnZb"><font color="#888888">-- <br><div dir="ltr"> Ed Cashin <<a href="mailto:ecashin@noserose.net" target="_blank">ecashin@noserose.net</a>></div>
</font></span></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr"> Ed Cashin <<a href="mailto:ecashin@noserose.net" target="_blank">ecashin@noserose.net</a>></div>
</div>