[mirror-admin] fullfilelist (was Re: Please use --delay-updates)
Angel Marin
anmar at anmar.eu.org
Tue Apr 20 14:26:53 EDT 2010
On 20/04/10 17:45, Carlos Carvalho wrote:
> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 09:48:
> >On Fri, 16 Apr 2010, Carlos Carvalho wrote:
> >
> >> Chuck Anderson (cra at wpi.edu) wrote on 16 April 2010 08:41:
> >> >Each time you run rsync against your upstream mirror, it scans the
> >> >entire filesystem to build a filelist. This could take anywhere from
> >> >5 to 20 minutes or more
> >>
> >> More... :-(
> >>
> >> >and has been a factor in overloading the master mirrors in the past.
> >>
> >> I'd say nowadays too... The table below shows the time we take just to
> >> get the file list from sync.fedoraproject, for the last days. We
> >> mirror everything starting from release 11. It shows clearly that the
> >> machine suffers significantly from disk scanning. The file list is
> >> only about 22MB. Times are in UTC-3.
> >>
> >> If fullfilelist was done properly we could completely avoid this
> >> scanning...
> >>
> >
> >Can you expand more on this, how can we do fullfilelist properly?
>
> Including timestamp and size (and type of object).
>
> The current version only gives the names. Downstream mirrors can use it
> to see what has been removed and created but cannot know what has been
> modified. They're thus forced to request a full disk scan. If you put
> the necessary info in fullfilelist mirrors can rsync it, see
> *everything* that must be updated and directly request only what's
> necessary with rsync --files-from. This way no disk scanning would be
> necessary upstream.
>
> The format I propose is the one generated by rsync itself:
>
> % cd /path/to/repository
> % rsync -r . > /path/to/fullfilelist
Wouldn't make sense to generate that list forcing the timezone on that
rsync run to for example utc?
Otherwise the reported date would be dependent on a particular server
configuration that might change in the future, messing parsers ...
> If you want fullfilelist to include itself it's of course necessary to
> adjust it afterwards but that's easy. Note also that "self-inclusion"
> is not necessary because mirrors would pull it always.
>
> It's possible to maintain this list without scanning the repo; it can
> be done by the procedure that updates the master. However even if it's
> done by scanning, its cost will be compensated by the scans that the
> mirrors will not inflict on the master. Even if it's only
> fedora.c3sl.ufpr.br that avoids it :-)
--
More information about the Mirror-admin
mailing list