<p dir="ltr">Oh yes! De-dupe process can totally eat IO for hours. ZFS supports dedupe at the block level as do several SAN devices. It will gulp time and storage to keep that list of checksum and block data. Add in the lookup for every block write. It will/can save more space and time than it costs.</p>
<p dir="ltr">Oh. Don't even think about block level dedupe on encrypted drives. It's an exercise for the reader on the reason :-)</p>
<div class="gmail_quote">On Oct 18, 2013 5:02 PM, "Jeff Hubbs" <<a href="mailto:jhubbslist@att.net">jhubbslist@att.net</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
When I was running a previous employer's file server (that I built on Gentoo, btw, referencing the other thread), I would pipe find output to xargs to md5sum to sort so that I could get a text file that I could visually eyeball to see where the dupes tended to be. In my view it wasn't a big deal until you had, like, ISO images that a dozen or more people had copies of - if that's going on, there needs to be some housecleaning and organization taking place. I suppose if you wanted you could script something that moved dupes to a common area and generated links in place of the dupes, but I'm not sure if that doesn't introduce more problems than it solves.<br>
<br>
As for auto-de-duping filesystems - which I suppose involves some sort of abstraction between what the OS thinks are files and what actually goes on disk - I wonder if there wouldn't wind up being some rather casual disk operations that could set off a whole flurry of r/w activity and plug up the works for a little while. Fun to experiment with, I'm sure.<br>
<br>
On 10/18/13 12:34 PM, Calvin Harrigan wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Good Afternoon,<br>
I'm looking for a little advice/recommendation on file de-duplication software. I've have a disk filled with files that most certainly have duplicates. What's the best way to get rid of the duplicates. I'd like to check deeper than just file name/date/size. If possible I'd like to check content (checksum?). Are you aware of anything like that? Linux or windows is fine. Thanks<br>
______________________________<u></u>_________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/<u></u>listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/<u></u>listinfo</a><br>
<br>
</blockquote>
<br>
______________________________<u></u>_________________<br>
Ale mailing list<br>
<a href="mailto:Ale@ale.org" target="_blank">Ale@ale.org</a><br>
<a href="http://mail.ale.org/mailman/listinfo/ale" target="_blank">http://mail.ale.org/mailman/<u></u>listinfo/ale</a><br>
See JOBS, ANNOUNCE and SCHOOLS lists at<br>
<a href="http://mail.ale.org/mailman/listinfo" target="_blank">http://mail.ale.org/mailman/<u></u>listinfo</a><br>
</blockquote></div>