[mirror-admin] fullfilelist (was Re: Please use --delay-updates)

Carlos Carvalho carlos at fisica.ufpr.br
Wed Apr 21 20:09:27 EDT 2010


Mike McGrath (mmcgrath at redhat.com) wrote on 21 April 2010 15:58:
 >On Tue, 20 Apr 2010, Carlos Carvalho wrote:
 >
 >> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 14:52:
 >>  >On Tue, 20 Apr 2010, Carlos Carvalho wrote:
 >>  >
 >>  >> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 09:48:
 >>  >>  >On Fri, 16 Apr 2010, Carlos Carvalho wrote:
 >>  >>  >
 >>  >>  >> Chuck Anderson (cra at wpi.edu) wrote on 16 April 2010 08:41:
 >>  >>  >>  >Each time you run rsync against your upstream mirror, it scans the
 >>  >>  >>  >entire filesystem to build a filelist.  This could take anywhere from
 >>  >>  >>  >5 to 20 minutes or more
 >>  >>  >>
 >>  >>  >> More... :-(
 >>  >>  >>
 >>  >>  >>  >and has been a factor in overloading the master mirrors in the past.
 >>  >>  >>
 >>  >>  >> I'd say nowadays too... The table below shows the time we take just to
 >>  >>  >> get the file list from sync.fedoraproject, for the last days. We
 >>  >>  >> mirror everything starting from release 11. It shows clearly that the
 >>  >>  >> machine suffers significantly from disk scanning. The file list is
 >>  >>  >> only about 22MB. Times are in UTC-3.
 >>  >>  >>
 >>  >>  >> If fullfilelist was done properly we could completely avoid this
 >>  >>  >> scanning...
 >>  >>  >>
 >>  >>  >
 >>  >>  >Can you expand more on this, how can we do fullfilelist properly?
 >>  >>
 >>  >> Including timestamp and size (and type of object).
 >>  >>
 >>  >> The current version only gives the names. Downstream mirrors can use it
 >>  >> to see what has been removed and created but cannot know what has been
 >>  >> modified. They're thus forced to request a full disk scan. If you put
 >>  >> the necessary info in fullfilelist mirrors can rsync it, see
 >>  >> *everything* that must be updated and directly request only what's
 >>  >> necessary with rsync --files-from. This way no disk scanning would be
 >>  >> necessary upstream.
 >>  >>
 >>  >> The format I propose is the one generated by rsync itself:
 >>  >>
 >>  >> % cd /path/to/repository
 >>  >> % rsync -r . > /path/to/fullfilelist
 >>  >>
 >>  >> If you want fullfilelist to include itself it's of course necessary to
 >>  >> adjust it afterwards but that's easy. Note also that "self-inclusion"
 >>  >> is not necessary because mirrors would pull it always.
 >>  >>
 >>  >> It's possible to maintain this list without scanning the repo; it can
 >>  >> be done by the procedure that updates the master. However even if it's
 >>  >> done by scanning, its cost will be compensated by the scans that the
 >>  >> mirrors will not inflict on the master. Even if it's only
 >>  >> fedora.c3sl.ufpr.br that avoids it :-)
 >>  >>
 >>  >
 >>  >I take it once you pull that fullfilelist down, you'll do a diff against
 >>  >the fullfilelist you currently have to generate a final list or is there a
 >>  >step in there I'm not following?
 >>
 >> That's the general idea, yes.
 >>
 >
 >The wheels are in motion

Good!

 >matter of working with releng to find exactly
 >where that command should go and how it should get triggered.  It takes 45
 >minutes or so to run and it's explicitly tied to pushes so we can't just
 >throw it in a cron job.

Note that it's possible to avoid the full scan if you know what's
changed. One possibilitiy is to give the changed names to rsync and
tell it to scan just those, then merge the output with the previous
filelist. Details depend on the internal update process that builds
the master repo.

--


More information about the Mirror-admin mailing list