[mirror-admin] fullfilelist (was Re: Please use --delay-updates)

Carlos Carvalho carlos at fisica.ufpr.br
Fri Jun 4 17:10:58 EDT 2010


Mike McGrath (mmcgrath at redhat.com) wrote on 21 April 2010 15:58:
 >On Tue, 20 Apr 2010, Carlos Carvalho wrote:
 >
 >> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 14:52:
 >>  >On Tue, 20 Apr 2010, Carlos Carvalho wrote:
 >>  >
 >>  >> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 09:48:
 >>  >>  >On Fri, 16 Apr 2010, Carlos Carvalho wrote:
 >>  >>  >
 >>  >>  >> Chuck Anderson (cra at wpi.edu) wrote on 16 April 2010 08:41:
 >>  >>  >>  >Each time you run rsync against your upstream mirror, it scans the
 >>  >>  >>  >entire filesystem to build a filelist.  This could take anywhere from
 >>  >>  >>  >5 to 20 minutes or more
 >>  >>  >>
 >>  >>  >> More... :-(
 >>  >>  >>
 >>  >>  >>  >and has been a factor in overloading the master mirrors in the past.
 >>  >>  >>
 >>  >>  >> I'd say nowadays too... The table below shows the time we take just to
 >>  >>  >> get the file list from sync.fedoraproject, for the last days. We
 >>  >>  >> mirror everything starting from release 11. It shows clearly that the
 >>  >>  >> machine suffers significantly from disk scanning. The file list is
 >>  >>  >> only about 22MB. Times are in UTC-3.
 >>  >>  >>
 >>  >>  >> If fullfilelist was done properly we could completely avoid this
 >>  >>  >> scanning...
 >>  >>  >>
 >>  >>  >
 >>  >>  >Can you expand more on this, how can we do fullfilelist properly?
 >>  >>
 >>  >> Including timestamp and size (and type of object).
 >>  >>
 >>  >> The current version only gives the names. Downstream mirrors can use it
 >>  >> to see what has been removed and created but cannot know what has been
 >>  >> modified. They're thus forced to request a full disk scan. If you put
 >>  >> the necessary info in fullfilelist mirrors can rsync it, see
 >>  >> *everything* that must be updated and directly request only what's
 >>  >> necessary with rsync --files-from. This way no disk scanning would be
 >>  >> necessary upstream.
 >>  >>
 >>  >> The format I propose is the one generated by rsync itself:
 >>  >>
 >>  >> % cd /path/to/repository
 >>  >> % rsync -r . > /path/to/fullfilelist
 >>  >>
 >>  >> If you want fullfilelist to include itself it's of course necessary to
 >>  >> adjust it afterwards but that's easy. Note also that "self-inclusion"
 >>  >> is not necessary because mirrors would pull it always.
 >>  >>
 >>  >> It's possible to maintain this list without scanning the repo; it can
 >>  >> be done by the procedure that updates the master. However even if it's
 >>  >> done by scanning, its cost will be compensated by the scans that the
 >>  >> mirrors will not inflict on the master. Even if it's only
 >>  >> fedora.c3sl.ufpr.br that avoids it :-)
 >>  >>
 >>  >
 >>  >I take it once you pull that fullfilelist down, you'll do a diff against
 >>  >the fullfilelist you currently have to generate a final list or is there a
 >>  >step in there I'm not following?
 >>
 >> That's the general idea, yes.
 >>
 >
 >The wheels are in motion, matter of working with releng to find exactly
 >where that command should go and how it should get triggered.  It takes 45
 >minutes or so to run and it's explicitly tied to pushes so we can't just
 >throw it in a cron job.

Any news?

--


More information about the Mirror-admin mailing list