October 31, 2012

rsync is not enough

Ask how can one create a Debian mirror and you will get a dozen different responses. Those who are used to mirroring content, whether it is distribution packages, software in general, documents, etc., will usually come up with one answer: rsync.

Truth is: for Debian's archive, rsync is not enough.

Due to the design of the archive, a single call to rsync leaves the mirror in an inconsistent state during most of the sync process.
Explanation is simple: index files are stored in dists/, package files are stored in pool/. Sort those directories by name (just like rsync does) and you get that the indexes will be updated before the actual packages are downloaded.

There are multiple scripts out there that do exactly that, one of them in Ubuntu's wiki. Plenty more if you search the web.

Now, addressing that issue shouldn't be so difficult, right? after all, all the index files are in dists/, so syncing in two stages should be enough. It's not that simple.

With the dists/ directory containing over 8.5GBs worth of indexes and, erm, installer files, even a two stages sync will usually leave the mirror in an inconsistent state for a while.

How about only deferring to the second stage the bare minimum?, I hear you ask.
That is the current approach, but it leads to some errors when new index files are added and used. The fact that people insist in writing their own scripts doesn't help.

Hopefully, some ideas like moving the installer stuff out of dists/ and
overhauling the repository layout are being considered. An alternative is to make the users of the mirrors more robust and fault-tolerant, but we would be talking about tenths if not hundreds of tools that would need to be improved.

In all cases, the one script that is actively maintained, is rather portable, and improved from time to time is the ftpsync script. Please, do yourself and your users a favour: don't attempt to reinvent the wheel (and forget about calling rsync just once).

7 comments:

  1. What about just keeping your mirror on LVM, making a snapshot, and having the public server point to the snapshot while you rsync the master copy?

    ReplyDelete
    Replies
    1. well, my approach is to:

      * download dists databases, read them
      * download packages listed in those that are not already present in the mirror
      * if all the files were successfully downloaded
      * update the dists databases
      * remove files in the mirror that are not anymore in the databases
      * else
      * leave old databases
      * also the new packages, so next time I don't have to download the updates again.

      You can check the script, which not only supports Debian repos, but also yum, urpmi, slackware and cpan, here:

      https://github.com/StyXman/psync

      Delete
    2. Jim: due to the diversity and the kind of systems used on the mirrors it would be unlikely that any mirror would use such approach.

      Marcos: it doesn't appear to know about changes to the archive format in the last few years. Moreover, it uses http which is suboptimal for mirroring, and doesn't mirror other files which are important. Perhaps it could work for a local, partial, mirror; but not at all for a mirror that ought to be part of the mirrors network.
      For those reading at home: debmirror also allows partial mirrors to be created, can use rsync, and is maintained and kept up to date with changes to the Debian archive format.

      Marcos, please don't take my comment wrong. I believe it is good to have different implementations that explore different approaches. However, I also believe they should eventually converge or allow others to take their place.

      Delete
    3. "Jim: due to the diversity and the kind of systems used on the mirrors it would be unlikely that any mirror would use such approach."

      That's weird. Even if some mirrors have inferior software and can't play along, it still seems prudent to suggest a non-fragile solution like snapshots for those with modern capabilities.

      Delete
    4. True, but the key point here is feasibility. If anybody is willing to spend time implementing such tool and then having some mirrors buy the idea and deploy it: good.

      I'm sceptic of people actually using it. Like I mentioned, there's still people who only run rsync once...

      Delete
  2. Well, maybe a quite foolish idea:

    The main repository that is replicated should be intact all the time, right? So what if you rsync it to a new place that is not published with rsync --link-dest $published_place and switch it atomically once all syncs completed successfully?

    ReplyDelete
    Replies
    1. The part of making an atomic switch is one of the biggest issues. Even if that property could easily be satisfied at the file system level, then there is the problem of a user updating when the mirror is in state A and, while the user is still updating its index files, the mirror switches to state B.
      In that case, and in spite of the atomic property, the mirror is in an inconsistent state from the user's point of view.

      The current aim is to not have the need of an atomic update.

      Delete