[mirror-admin] fullfilelist (was Re: Please use --delay-updates)

Mike McGrath mmcgrath at redhat.com
Wed Apr 21 16:58:20 EDT 2010


On Tue, 20 Apr 2010, Carlos Carvalho wrote:

> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 14:52:
>  >On Tue, 20 Apr 2010, Carlos Carvalho wrote:
>  >
>  >> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 09:48:
>  >>  >On Fri, 16 Apr 2010, Carlos Carvalho wrote:
>  >>  >
>  >>  >> Chuck Anderson (cra at wpi.edu) wrote on 16 April 2010 08:41:
>  >>  >>  >Each time you run rsync against your upstream mirror, it scans the
>  >>  >>  >entire filesystem to build a filelist.  This could take anywhere from
>  >>  >>  >5 to 20 minutes or more
>  >>  >>
>  >>  >> More... :-(
>  >>  >>
>  >>  >>  >and has been a factor in overloading the master mirrors in the past.
>  >>  >>
>  >>  >> I'd say nowadays too... The table below shows the time we take just to
>  >>  >> get the file list from sync.fedoraproject, for the last days. We
>  >>  >> mirror everything starting from release 11. It shows clearly that the
>  >>  >> machine suffers significantly from disk scanning. The file list is
>  >>  >> only about 22MB. Times are in UTC-3.
>  >>  >>
>  >>  >> If fullfilelist was done properly we could completely avoid this
>  >>  >> scanning...
>  >>  >>
>  >>  >
>  >>  >Can you expand more on this, how can we do fullfilelist properly?
>  >>
>  >> Including timestamp and size (and type of object).
>  >>
>  >> The current version only gives the names. Downstream mirrors can use it
>  >> to see what has been removed and created but cannot know what has been
>  >> modified. They're thus forced to request a full disk scan. If you put
>  >> the necessary info in fullfilelist mirrors can rsync it, see
>  >> *everything* that must be updated and directly request only what's
>  >> necessary with rsync --files-from. This way no disk scanning would be
>  >> necessary upstream.
>  >>
>  >> The format I propose is the one generated by rsync itself:
>  >>
>  >> % cd /path/to/repository
>  >> % rsync -r . > /path/to/fullfilelist
>  >>
>  >> If you want fullfilelist to include itself it's of course necessary to
>  >> adjust it afterwards but that's easy. Note also that "self-inclusion"
>  >> is not necessary because mirrors would pull it always.
>  >>
>  >> It's possible to maintain this list without scanning the repo; it can
>  >> be done by the procedure that updates the master. However even if it's
>  >> done by scanning, its cost will be compensated by the scans that the
>  >> mirrors will not inflict on the master. Even if it's only
>  >> fedora.c3sl.ufpr.br that avoids it :-)
>  >>
>  >
>  >I take it once you pull that fullfilelist down, you'll do a diff against
>  >the fullfilelist you currently have to generate a final list or is there a
>  >step in there I'm not following?
>
> That's the general idea, yes.
>

The wheels are in motion, matter of working with releng to find exactly
where that command should go and how it should get triggered.  It takes 45
minutes or so to run and it's explicitly tied to pushes so we can't just
throw it in a cron job.

	-Mike

--


More information about the Mirror-admin mailing list