[mirror-admin] rsync filtering to reduce master mirror load
J.H.
warthog9 at kernel.org
Sat May 7 15:17:09 EDT 2011
On 05/07/2011 11:59 AM, dale at fedoraproject.org wrote:
> On Fri, 10 Apr 2009, J.H. wrote:
>> Matt Domsch wrote:
>>> One of the things that's bothered me for a while is that each mirror
>>> syncs itself to it's upstream mirror (either a master, Tier 0, or Tier
>>> 1). But in general, content on the master mirrors changes only a
>>> few times a day (generally one rawhide push, one updates/ push, one
>>> pub/epel/updates/testing push). Most of the content doesn't change.
>>> And running rsync to discover that nothing has changed is expensive on
>>> the upstream server - millions of stat() calls.
>>>
>>> I call these the "null rsyncs".
>>>
>>> In the next version of MirrorManager to roll out (hopefully today if I
>>> finish working the bugs out), the MM database now keeps track of the
>>> "last changed time" of each directory. Using this, it can generate an
>>> rsync FILTER RULES file (rsync --exclude-file=<somefile>), which rsync
>>> then uses to reduce the full directory tree traversal, and limits it
>>> only to those directory paths that have changed.
>>>
>>> For example, this script:
>>>
>>> #!/bin/sh
>>> now=$(date -u +%s)
>>> yesterday=$((now - (24 * 60 * 60)))
>>> wget -O - \
>>>
>>> "http://localhost/mirrormanager/rsyncFilter?categories=Fedora%20Linux&since=$yesterday&stripprefix=pub/fedora"
>>> \
>>> 2>/dev/null
>>>
>>>
>>> returns an rsync filter rules file that looks like:
>>
>>
>> So this solves the problem for effectively the 'tier 0' or 'tier 1'
>> mirrors, and the few people who are still syncing directly from
>> Fedora. I would love, and I'm sure I'm not alone in this, the ability
>> (maybe through report_mirror) that when a tier [01] completes a sync
>> that it can report, get discovered, something where it's at in it's
>> update schedule. This would then allow tier [n+1] mirrors to add a
>> small change to your url above to something like:
>>
>> https://admin.fedoraproject.org/mirrormanager/rsyncFilter?categories=Fedora%20Linux&since=$yesterday&stripprefix=pub/fedora&upstream=<upstream
>> base url like mirrors.kernel.org>
>>
>> And the tier[n+1] mirrors then have the ability to gain an rsync list
>> custom to where they are syncing from. I would be more than happy to
>> mod my rsync script, post it back here, in some form that could take
>> advantage of this should something get modified.
>>
>> Something like this would really help the larger mirrors, cut rsync
>> times down and likely help keep people better in sync.
>>
>> Just my $0.02
>>
>> - John 'Warthog9' Hawley
>> Chief Kernel.org Administrator
>>
>> --
>
> Hi, I was wondering if any progress was ever made on John's suggestion
> to allow querying for changes of arbitrary upstream mirrors.
>
> I also wonder how this has worked out. Are many mirrors using this
> feature, and has been helpful to the upstreams? It still seems like a
> good idea.
>
> I'm looking into starting another mirror, and decided to clean up old
> code I used for a mirror before and place it on github. It currently
> uses the rsyncFilter above, which isn't really safe for a tier 2 mirror.
>
> https://github.com/dlbewley/mirror-fedora
The long and short was, it was too complicated and too rife with
potential for failure, I know we turned it off completely a long time
ago and generally accepted a null rsync was safer than missing content
changes.
There's other stuff out there in the works, like Chasmd that would solve
the problem in roughly the same way, and it's something I know I've been
looking into.
- John 'Warthog9' Hawley
--
More information about the Mirror-admin
mailing list