[ale] Filed De-duplication

Fri Oct 18 19:57:23 EDT 2013

I'm building a 24TB (~21 TB usable space) FreeNAS system for a friend of
mine. ZFS alone is a bit of a memory hog. This machine has 16GB of memory,
which is about the "sweet spot" as some knowlegable users have posed.
However, they recommend a hefty 5GB of memory per TB of data that you're
storing (as a rule of thumb) if you enable deduplication. So, instead of
that, I just decided to settle for compression, which is pretty fast, and
not nearly as memory or processor intensive.

On Fri, Oct 18, 2013 at 5:32 PM, Jim Kinney <jim.kinney at gmail.com> wrote:

> Oh yes! De-dupe process can totally eat IO for hours. ZFS supports dedupe
> at the block level as do several SAN devices. It will gulp time and storage
> to keep that list of checksum and block data. Add in the lookup for every
> block write. It will/can save more space and time than it costs.
>
> Oh. Don't even think about block level dedupe on encrypted drives. It's an
> exercise for the reader on the reason :-)
> On Oct 18, 2013 5:02 PM, "Jeff Hubbs" <jhubbslist at att.net> wrote:
>
>> When I was running a previous employer's file server (that I built on
>> Gentoo, btw, referencing the other thread), I would pipe find output to
>> xargs to md5sum to sort so that I could get a text file that I could
>> visually eyeball to see where the dupes tended to be.  In my view it wasn't
>> a big deal until you had, like, ISO images that a dozen or more people had
>> copies of - if that's going on, there needs to be some housecleaning and
>> organization taking place.  I suppose if you wanted you could script
>> something that moved dupes to a common area and generated links in place of
>> the dupes, but I'm not sure if that doesn't introduce more problems than it
>> solves.
>>
>> As for auto-de-duping filesystems - which I suppose involves some sort of
>> abstraction between what the OS thinks are files and what actually goes on
>> disk - I wonder if there wouldn't wind up being some rather casual disk
>> operations that could set off a whole flurry of r/w activity and plug up
>> the works for a little while. Fun to experiment with, I'm sure.
>>
>> On 10/18/13 12:34 PM, Calvin Harrigan wrote:
>>
>>> Good Afternoon,
>>>     I'm looking for a little advice/recommendation on file
>>> de-duplication software. I've have a disk filled with files that most
>>> certainly have duplicates.  What's the best way to get rid of the
>>> duplicates.  I'd like to check deeper than just file name/date/size.  If
>>> possible I'd like to check content (checksum?).  Are you aware of anything
>>> like that?  Linux or windows is fine.  Thanks
>>> ______________________________**_________________
>>> Ale mailing list
>>> Ale at ale.org
>>> http://mail.ale.org/mailman/**listinfo/ale<http://mail.ale.org/mailman/listinfo/ale>
>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> http://mail.ale.org/mailman/**listinfo<http://mail.ale.org/mailman/listinfo>
>>>
>>>
>> ______________________________**_________________
>> Ale mailing list
>> Ale at ale.org
>> http://mail.ale.org/mailman/**listinfo/ale<http://mail.ale.org/mailman/listinfo/ale>
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/**listinfo<http://mail.ale.org/mailman/listinfo>
>>
>
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20131018/ee3b254d/attachment-0001.html>