[ale] Filed De-duplication

leam hall leamhall at gmail.com
Fri Oct 18 13:14:45 EDT 2013


Hmm....

This <pseudo-code> is off the top of my head, so there are probably some
serious issues with it.

for file in `find /my_dir`
do
   MD5=`md5sum $file`
   EXISTS=`grep $MD5 <file_of_sums> | wc -l`
   if [ $EXISTS -ne 0 ]
   then
      EXISTS=0
      rm $file
   else
      echo "$MD5" >> <file_of_sums>
    fi
done





On Fri, Oct 18, 2013 at 12:59 PM, JD <jdp at algoloma.com> wrote:

> Slashdot had a question about this 1-2 yrs ago.  Lots of people suggested
> scripting it, others pointed out some C code on sourceforge.
>
> I had a few hrs free that day and wrote some Perl (200+ LOC). Use it all
> the
> time, but I'd probably go with the C tool for any very large datasets.
>  Mine
> doesn't automaticly remove anything and is far from perfect, that is
> certain.
> It is relatively fast on most types of files, however.
>
> On 10/18/2013 12:34 PM, Calvin Harrigan wrote:
> > Good Afternoon,
> >     I'm looking for a little advice/recommendation on file de-duplication
> > software. I've have a disk filled with files that most certainly have
> > duplicates.  What's the best way to get rid of the duplicates.  I'd like
> to
> > check deeper than just file name/date/size.  If possible I'd like to
> check
> > content (checksum?).  Are you aware of anything like that?  Linux or
> windows is
> > fine.  Thanks
> > _______________________________
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo
>



-- 
Mind on a Mission <http://leamhall.blogspot.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.ale.org/pipermail/ale/attachments/20131018/2f3ac3ba/attachment.html>


More information about the Ale mailing list