[ale] Filed De-duplication

Fri Oct 18 19:58:05 EDT 2013

(posted, not posed)

On Fri, Oct 18, 2013 at 6:57 PM, Doug Hall <doughalldev at gmail.com> wrote:

> I'm building a 24TB (~21 TB usable space) FreeNAS system for a friend of
> mine. ZFS alone is a bit of a memory hog. This machine has 16GB of memory,
> which is about the "sweet spot" as some knowlegable users have posed.
> However, they recommend a hefty 5GB of memory per TB of data that you're
> storing (as a rule of thumb) if you enable deduplication. So, instead of
> that, I just decided to settle for compression, which is pretty fast, and
> not nearly as memory or processor intensive.
>
>
> On Fri, Oct 18, 2013 at 5:32 PM, Jim Kinney <jim.kinney at gmail.com> wrote:
>
>> Oh yes! De-dupe process can totally eat IO for hours. ZFS supports dedupe
>> at the block level as do several SAN devices. It will gulp time and storage
>> to keep that list of checksum and block data. Add in the lookup for every
>> block write. It will/can save more space and time than it costs.
>>
>> Oh. Don't even think about block level dedupe on encrypted drives. It's
>> an exercise for the reader on the reason :-)
>> On Oct 18, 2013 5:02 PM, "Jeff Hubbs" <jhubbslist at att.net> wrote:
>>
>>> When I was running a previous employer's file server (that I built on
>>> Gentoo, btw, referencing the other thread), I would pipe find output to
>>> xargs to md5sum to sort so that I could get a text file that I could
>>> visually eyeball to see where the dupes tended to be.  In my view it wasn't
>>> a big deal until you had, like, ISO images that a dozen or more people had
>>> copies of - if that's going on, there needs to be some housecleaning and
>>> organization taking place.  I suppose if you wanted you could script
>>> something that moved dupes to a common area and generated links in place of
>>> the dupes, but I'm not sure if that doesn't introduce more problems than it
>>> solves.
>>>
>>> As for auto-de-duping filesystems - which I suppose involves some sort
>>> of abstraction between what the OS thinks are files and what actually goes
>>> on disk - I wonder if there wouldn't wind up being some rather casual disk
>>> operations that could set off a whole flurry of r/w activity and plug up
>>> the works for a little while. Fun to experiment with, I'm sure.
>>>
>>> On 10/18/13 12:34 PM, Calvin Harrigan wrote:
>>>
>>>> Good Afternoon,
>>>>     I'm looking for a little advice/recommendation on file
>>>> de-duplication software. I've have a disk filled with files that most
>>>> certainly have duplicates.  What's the best way to get rid of the
>>>> duplicates.  I'd like to check deeper than just file name/date/size.  If
>>>> possible I'd like to check content (checksum?).  Are you aware of anything
>>>> like that?  Linux or windows is fine.  Thanks
>>>> ______________________________**_________________
>>>> Ale mailing list
>>>> Ale at ale.org
>>>> http://mail.ale.org/mailman/**listinfo/ale<http://mail.ale.org/mailman/listinfo/ale>
>>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>>> http://mail.ale.org/mailman/**listinfo<http://mail.ale.org/mailman/listinfo>
>>>>
>>>>
>>> ______________________________**_________________
>>> Ale mailing list
>>> Ale at ale.org
>>> http://mail.ale.org/mailman/**listinfo/ale<http://mail.ale.org/mailman/listinfo/ale>
>>> See JOBS, ANNOUNCE and SCHOOLS lists at
>>> http://mail.ale.org/mailman/**listinfo<http://mail.ale.org/mailman/listinfo>
>>>
>>
>> _______________________________________________
>> Ale mailing list
>> Ale at ale.org
>> http://mail.ale.org/mailman/listinfo/ale
>> See JOBS, ANNOUNCE and SCHOOLS lists at
>> http://mail.ale.org/mailman/listinfo
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20131018/c5c328ac/attachment.html>