[mirror-admin] fullfilelist (was Re: Please use --delay-updates)
Brian Long
brilong at cisco.com
Wed Apr 21 09:38:44 EDT 2010
On 04/20/2010 06:50 PM, Carlos Carvalho wrote:
> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 14:52:
> >On Tue, 20 Apr 2010, Carlos Carvalho wrote:
> >
> >> Mike McGrath (mmcgrath at redhat.com) wrote on 20 April 2010 09:48:
> >> >On Fri, 16 Apr 2010, Carlos Carvalho wrote:
> >> >
> >> >> Chuck Anderson (cra at wpi.edu) wrote on 16 April 2010 08:41:
> >> >> >Each time you run rsync against your upstream mirror, it scans the
> >> >> >entire filesystem to build a filelist. This could take anywhere from
> >> >> >5 to 20 minutes or more
> >> >>
> >> >> More... :-(
> >> >>
> >> >> >and has been a factor in overloading the master mirrors in the past.
> >> >>
> >> >> I'd say nowadays too... The table below shows the time we take just to
> >> >> get the file list from sync.fedoraproject, for the last days. We
> >> >> mirror everything starting from release 11. It shows clearly that the
> >> >> machine suffers significantly from disk scanning. The file list is
> >> >> only about 22MB. Times are in UTC-3.
> >> >>
> >> >> If fullfilelist was done properly we could completely avoid this
> >> >> scanning...
> >> >>
> >> >
> >> >Can you expand more on this, how can we do fullfilelist properly?
> >>
> >> Including timestamp and size (and type of object).
> >>
> >> The current version only gives the names. Downstream mirrors can use it
> >> to see what has been removed and created but cannot know what has been
> >> modified. They're thus forced to request a full disk scan. If you put
> >> the necessary info in fullfilelist mirrors can rsync it, see
> >> *everything* that must be updated and directly request only what's
> >> necessary with rsync --files-from. This way no disk scanning would be
> >> necessary upstream.
> >>
> >> The format I propose is the one generated by rsync itself:
> >>
> >> % cd /path/to/repository
> >> % rsync -r . > /path/to/fullfilelist
> >>
> >> If you want fullfilelist to include itself it's of course necessary to
> >> adjust it afterwards but that's easy. Note also that "self-inclusion"
> >> is not necessary because mirrors would pull it always.
> >>
> >> It's possible to maintain this list without scanning the repo; it can
> >> be done by the procedure that updates the master. However even if it's
> >> done by scanning, its cost will be compensated by the scans that the
> >> mirrors will not inflict on the master. Even if it's only
> >> fedora.c3sl.ufpr.br that avoids it :-)
> >>
> >
> >I take it once you pull that fullfilelist down, you'll do a diff against
> >the fullfilelist you currently have to generate a final list or is there a
> >step in there I'm not following?
>
> That's the general idea, yes.
I think this has been brought up in the past, but if you miss a single
version of fullfilelist, how do you get back up to date without a full
disk scan? If fedora.c3sl.ufpr.br syncs every 8 hours, for example, but
fullfilelist was changed twice on the upstream mirror,
fedora.c3sl.ufpr.br will be out-of-date because it missed one of the two
fullfilelist files. Is that correct or am I overlooking something?
I'm worried about the tiered mirror system and fullfilelist not being a
good combination because a tier 2 or tier 3 might miss one or two of
these files and be marked by MirrorManager as out-of-date. If there was
an email notification system inside MirrorManager which could alert
mirror admins that their system fell out-of-date, this might help.
/Brian/
--
Brian Long | |
Corporate Security Programs Org . | | | . | | | .
' '
C I S C O
--
More information about the Mirror-admin
mailing list