Raphael's blog: 2012

December 24, 2012

A bashism a week: taking a break

Short notice: due to holidays and people, rightfully, not paying much attention to the online world, this Wednesday there won't be a post from the "a bashism a week" series.

Enjoy the break.

December 19, 2012

A bashism a week: testing for equality

Well known, yet easy to find just about everywhere: using the "test"/"[" commands to test for equality with two equals signs (==).

Contrary to many programming languages, if you want to test for equality in a shell script you must only use the equals sign once.

Try to keep this in mind: under a shell that implements what is required by POSIX:2001, you may hit the unexpected in the following code.

if [ foo == foo ]; then
    echo expected
else
    echo unexpected
fi

December 18, 2012

Nicer, but stricter

Lately I've been working on making the redirector nicer to the mirrors and to some potential users. More specifically, those behind a caching proxy.

The redirector is now nicer to traditional web proxies by redirecting to objects that are known not to change with a "moved permanently" code (HTTP status code 301.) This applies to files in the pool/ directory and ".pdiff" files, among others.
Previously, a traditional caching web proxy would sometimes end up with multiple copies of the same object, fetched from different mirrors; and the redirection would not be cached at all. With this change, this is no longer the case.

Using a caching proxy that is aware of the Debian repository design is still more likely to yield better results, however: If my memory serves correctly, apt-cacher has the ability of updating the Packages, Sources, and similar files with the ".pdiff"s on the server side. Apt-Cacher-NG apparently can use debdelta, and so on.
Check my blog post about one APT caching proxy not being efficient for some comments related to those tools.

Another recent change is that mirrors that can't be used by the redirector will no longer be monitored as often as the other mirrors. For instance, if a mirror doesn't generate a trace file (used for monitoring) then the redirector will gradually limit the rate at which the mirror is checked.
This rate-limiting mechanism applies to different kinds of errors, and should reduce the amount of wasted time and bandwidth while still allowing automatic-detection of mirrors that recover.

Projection of a rate-limited mirror over six weeks. The mirror would have to fail in every attempt for that to happen.
N.b. there's a bump in the scale.

The rate limiter applies an initial exception to allow temporary errors to not affect the use of the mirror by the redirector. After that exception, it is pretty much linear. However, that chart doesn't really reflect the effect of the rate limiter, so put in comparison with the normal checking behaviour:

Comparison of the two behaviours over an 8 weeks period using a logarithmic scale.
Nice chart colours by Libreoffice.

The code to detect mirrors that don't perform a two-stages sync that I talked about in a previous post has not yet been integrated as the current implementation would be too expensive on the mirrors to just add it as-is.

While tracking down problems exposed to users, I decided to take a stricter approach as to what mirrors are used by the redirector. Suffice to say that the remaining mirrors using the obsolete anonftpsync are going to be ignored entirely. ftpsync has been around for a few years now and it is the standard tool.
Whether you are mirroring Debian, Raspbian, Ubuntu, or any other Debian-like packages repository, ftpsync is the right tool to use.

Most of the issues I've been discovering, and sometimes working around, affect direct users of the mirrors and are not related to the http.debian.net redirector. When not detected beforehand they happen to be exposed by the redirector, but like I said, I plan to be stricter in order to increase the redirector's reliability. Once a strict and reliable foundation is built, more workarounds might see their way in to better use the available resources.

That's it for now. The road is long, the challenge is great, and being an observer in an uncontrolled environment makes it even more interesting.

December 12, 2012

A bashism a week: $RANDOM numbers

Commonly used to sleep a random amount of time or to create unique temporary file names, $RANDOM is one of those bashisms that you are best avoiding it altogether.

It is not uncommon to see scripts generating a "unique" temporary file name with code that goes like: tempf="/tmp/foo.$RANDOM", or tempf="/tmp/foo.$$.$RANDOM".

Under some shells the "unique" temporary file name will be "/tmp/foo." for the first example code. So much for randomness, right?

Even if you go around it by defining $RANDOM to the output of cksum after reading some bytes from /dev/urandom, please: don't do that. Use the mktemp command instead.
When creating temporary files there's more than just generating a file name. Just don't do it on your own: use mktemp. Really, use it, the list of those who weren't using mktemp (or similar) is large enough as it is.

Don't even dare to mention the linux kernel-based protection against symlink attacks. There's no excuse for not using mktemp.

Tip: If you are going to use multiple temporary files, create a temporary directory instead. Use mktemp -d.
Tip: Don't reuse a temporary file's name, even if you unlink/remove it. Generate a new one with mktemp.
Tip: Reusing also means doing things like tmp="$(mktemp)"; some_command > "$tmp.stdout" 2> "$tmp.stderr"
Tip: Even if $RANDOM is not empty, don't use it. It could have been exported as an environment variable. Again, just use mktemp.

For the remaining cases where you may want a pseudo random number such as for sleeping a random number of seconds: you can use something as simple as $$. Use shell arithmetic to adjust it as needed: use the modulo operator, multiply it, etc.

If you think you need something more "random" than the process' id, then you should probably not be using $RANDOM in the first place.

December 05, 2012

Introducing: a bashism a week

No matter how many scripting programming languages exist, it appears that shell programming is here to stay around. In many cases it is fast, it "does the job", and best of all: it is available "everywhere". The shell is used by makefiles, on every call to system(), and whatnot.

However, it is a real pain, implementations differ from the standards, some implementations still in use pre-date them, they leave room for undefined behaviour, and bugs in the implementations are nothing but unknown. You can't just specify a given shell interpreter and think you've dealt with the problem. Writing shell scripts that are portable among many platforms is a nightmare, if even possible.

Surprisingly, in spite of all that, a great amount of shell scripts appear to work flawlessly in many systems.

The switch from bash to dash as the default shell interpreter in Debian wasn't done without quite some work (more if you list archived bug reports), and the work ain't over.

For the following months I will be writing about different "bashisms" every Wednesday, hopefully helping people write slightly-more-portable shell scripts. The posts are going to be focused on widely-seen bashisms, probably ignoring those that Debian's policy defines as required to be implemented.

The term "bashism" must be understood as any feature or behaviour not required by SUSv3 (aka POSIX:2001), no matter what its origins are or even if the behaviour is not exhibited by the bash shell.

One of the key points is documenting the script's requirements, starting by specifying the right shell interpreter in the shebang.

Let's see what comes out of this experiment.

As a matter of fact, I have a few months worth of posts written already. All posts are going to be published at scheduled dates, just like this very post.

December 03, 2012

Some things you wanted to know about http.debian.net

After quite a bit of, very welcome, feedback I've put together a FAQ page in an attempt to respond to the most common questions about http.debian.net.

Emails have been accumulating for a few weeks now, but I will get to them. So please be patient if you send me an email, or if you have sent me one.

November 17, 2012

Better routing, less bad apples

Another month, another update to http.debian.net. This time around most of the work was done outside the redirector's code base, as strange as it may sound.

The redirector heavily relies on the mirrors doing at least a couple of things right, for the rest it can and does compensate. When it needs to compensate, certain requests are redirected to automatically-detected good mirrors, thus avoiding mirrors that might work fine for some parts of the day but cause headaches during the rest.

So, part of the work done since the last update was to prod more mirror administrators to upgrade to the latest version of ftpsync. This reduces the number of mirrors for which compensation is needed in order to avoid errors during installations and upgrades. Hopefully, no additional work is needed for the redirector to notice the upgrades. This results in immediate improvements.

However, not all mirrors comply with the bare minimum requirements. As stated in my previous blog post, running rsync once is not enough. When mirrors break these assumptions they lead to the "bad apple" effect. The effects in this case are temporary errors, as experienced by some people. The interesting part of those issues is that the affected population may quickly change given the redirector's use of geo location and the way it creates mirror subsets.

As interesting as the distribution of the effects may be, they are not really welcome. So I put together some code to attempt to detect the bad apples. This resulted in a list of mirrors that have now been disabled in the redirector and whose administrators are going to be contacted so that they comply with the minimum requirements. Given that detection is time-sensitive, there's no 100% guarantee that all of them have been identified so far. The code to detect them will have to be adapted and integrated into the redirector's code base to be proactive on avoiding this kind of issues.

Last but not least, the redirector is now using a database of AS peers for better (re-)routing. This is the next move towards a decision making based more on network location/topology than in geographic location. This first use of a peers database is limited to IPv4 and is based on a recent routing table dump and on feedback provided by interested people. If you are a mirror or network administrator, or you are familiar with the topology of your network please drop me an email so that the redirector can make a better use of your peering agreements.

N.b. in the case of the database, the term peer may also include transit providers. It is used to refer to and establish a relationship between two AS(N)s.

Feedback is, as always, welcome. I read each and every email but it may take me some time to get to it, or reply.

October 31, 2012

rsync is not enough

Ask how can one create a Debian mirror and you will get a dozen different responses. Those who are used to mirroring content, whether it is distribution packages, software in general, documents, etc., will usually come up with one answer: rsync.

Truth is: for Debian's archive, rsync is not enough.

Due to the design of the archive, a single call to rsync leaves the mirror in an inconsistent state during most of the sync process.
Explanation is simple: index files are stored in dists/, package files are stored in pool/. Sort those directories by name (just like rsync does) and you get that the indexes will be updated before the actual packages are downloaded.

There are multiple scripts out there that do exactly that, one of them in Ubuntu's wiki. Plenty more if you search the web.

Now, addressing that issue shouldn't be so difficult, right? after all, all the index files are in dists/, so syncing in two stages should be enough. It's not that simple.

With the dists/ directory containing over 8.5GBs worth of indexes and, erm, installer files, even a two stages sync will usually leave the mirror in an inconsistent state for a while.

How about only deferring to the second stage the bare minimum?, I hear you ask.
That is the current approach, but it leads to some errors when new index files are added and used. The fact that people insist in writing their own scripts doesn't help.

Hopefully, some ideas like moving the installer stuff out of dists/ and
overhauling the repository layout are being considered. An alternative is to make the users of the mirrors more robust and fault-tolerant, but we would be talking about tenths if not hundreds of tools that would need to be improved.

In all cases, the one script that is actively maintained, is rather portable, and improved from time to time is the ftpsync script. Please, do yourself and your users a favour: don't attempt to reinvent the wheel (and forget about calling rsync just once).

October 22, 2012

Where to get checkbashisms from (community service)

Lately I've been spending some time checking the Debian archive for bashisms in preparation of the release of Debian wheezy. This requires running checkbashisms against every /bin/sh script, checking the results by hand to discard some false positives, and filing bug reports of bashisms.
And of course, fixing and improving checkbashisms; some of that work to be published soon.

It is fun that when one fixes some parsing errors it leads to regressions in the form of false negatives due to other parsing errors... oh well.

However, while looking around the web for references about checkbashisms, I noticed that somebody created a sourceforge project under that same name. It is a fork of an old version of checkbashisms, and hasn't seen an update in over a year. It even appears that a FreeBSD port is based on it.

If you are looking for the latest checkbashisms, please get it either from the latest version of devscripts, or from devscripts' git repository.

October 13, 2012

Debian mirrors map

Ever wondered how the Earth would look like if you added markers for every mirror that is part of Debian's mirrors network?

(the bigger the shadow of the marker, the larger the number of mirrors in that zone)

Mirrors map generated with leaflet, using Openstreetmap.org tiles, and mirrors geolocation using GeoLite data created by MaxMind, available from maxmind.com.

September 18, 2012

More mirrors, slightly better IPv6 support and more

As announced in the Debian Project News of September more mirrors have recently joined Debian's mirrors network. Two more have been added since that issue of DPN was released, and about four more might be added any day now. Many thanks to those sponsoring them!

The mirrors redirector has also received some improvements, and around 2.5 million requests a week.
Among the improvements, users of teredo and 6to4 tunnels are now handled as IPv4 users. Not only it gives them access to a wider range of mirrors, but it also avoids the tunnel overhead.

Another improvement is the prompt detection of mirrors that drop architectures without prior notification. This is thanks to a new version of ftpsync, the standard and recommended tool to create Debian mirrors suitable for Debian's mirrors network.
As more mirror administrators upgrade to the newer version, or switch to it, http.debian.net and other tools become more reliable: they no longer need to perform some checks and hope the results mean the mirrors actually include the architectures they say they include.

In this regard, http.debian.net is an improvement over debian-installer's mirrors selection menu: not only it chooses a mirror that is close to the user and up to date, it also ensures the mirror provides the files for an architecture even after the installer image has been built. debian-installer's mirrors list is static and hard-coded into the installation media.

There's more to come, stay tuned.

July 10, 2012

Experimental IPv6 support

For a few hours now, http.debian.net is also reachable over IPv6. The service is still experimental, as the code makes a few assumptions that are not yet automatically verified and the IPv6 GeoIP database is sort-of experimental too.

Note that the results between a request over IPv4 and one over IPv6 may vary quite a bit. Requests over IPv6 are redirected only to hosts that support it. A great number of mirrors, however, don't support IPv6.

July 08, 2012

Updates to http.debian.net

It's been a week with quite some changes to http.debian.net, the Debian mirrors redirector, and it keeps coping very well with the continuously increasing traffic: over 1.5 million requests from APT clients in the last seven days! half-million more from the week before.

The good news is, those rare Hash Sum mismatch errors should be mostly gone. Ditto for some other sort of errors. There is now a second server that takes care of monitoring the mirrors and is ready to handle some of the traffic if there's any need. With this new server, http.debian.net will soon be available over IPv6 too.

Those who are at Debconf12 and followed my advice to use http.debian.net will have noticed that it is redirecting users to the local mirror. So, once again, forget about switching mirrors!

Thanks are also due to Jörg Jaspert and Philipp Kern, for the new server and for the work needed to allow http.debian.net access to Debconf12's local mirror, respectively.

Many thanks again to those who keep providing feedback and have helped the project along the way.

What's next? even more improvements and fixing some issues, some of which involve further collaboration and cooperation with the mirror administrators.

Some parts of the Debian mirrors network are fragile and may bite every once in a while, but those are being worked on. If you administer a mirror, please use ftpsync and submit your Debian mirror.
You might be happy to know that I'm working on restricting mirrors to specific Autonomous System, or countries. More on this later.

June 30, 2012

One week later

It's been over a week since I publicly announced the Debian mirrors redirector.
It was covered in several places, including Linux Magazin and mentioned in the latest Debian Project News.

Thanks to all that, and quite a number of early testers of the wheezy image for the Raspberry Pi, http.debian.net has processed over one million requests from APT clients alone. This figure doesn't count users of apt-cacher/apt-cacher-ng/approx, or those who happen to be mirroring Debian with wget. Oh, and this is only the number of requests since last Sunday, when the logs were rotated.

By the way, if you are going to Debconf12 in Nicaragua, why don't you give http.debian.net a try and forget about switching mirrors?

I'm happy to see how it was welcomed and the many emails, blog comments, IRC messages, and other sorts of messages I've received about it. There are surely some rough edges but the service is still under development and all the feedback helps.

In some cases there appear to be some confusion as to how exactly it chooses the mirrors and why sometimes some mirrors are not used. I'll try to cover some of those topics in this blog, hoping the

P.S. If you've sent me an email about http.d.n and I haven't replied, please rest assured I try to take a quick look at them but I can't reply to every single message fast enough.

Thanks to everyone!

June 21, 2012

Introducing http.debian.net, Debian's mirrors redirector

You've been there: you are about to install a package, upgrade, or get the source package, and the mirror fails. It is offline, it is out of date, etc. Whatever the reason, you couldn't do it.

That's only one of the issues that http.debian.net attempts to address. In a nutshell: it works as an http-only content distribution network. It uses the network and geographic location to select the best online and up-to-date mirrors.

APT's sources.list one liner:

deb http://http.debian.net/debian stable main

Use /debian-backports for backports, and /debian-archive for archive.debian.org.

Originally introduced to the Debian mirrors community back in January, the code on the server and client sides has improved since. All is needed is squeeze's APT (or aptitude), but wheezy's and sid's will perform better.

Advantages and more details are discussed at the Debian mirrors redirector's website.

(Thanks to the early testers, everyone who has provided feedback, my friends at PuffinHost, and all the mirrors-related people, obviously including those who sponsor them!)

June 20, 2012

Not your gsoc project

Thanks to Google, for some years now there has been the so-called Google Summer of Code programme. Quite an interesting project it is.

However, how many of the slots assigned to Debian have resulted in successful projects? how many of them were actually completed by the student and not by the mentor or somebody else? are those projects in fact used or useful at all?

All these questions occurred to me while wondering what would have happened if Joey Hess' git-annex-related kickstarter project had instead been a GSOC project. His kickstart was very successful by the way. Joey: many congratulations!

Not trying to point my finger to the (IMHO many) failed or unsuccessful Debian-related gsoc projects, but I wonder how many successful projects are developed outside the get-paid-in-exchange model. I suspect most, at least in Debian.

What's done for those who actually "deliver" while working on projects as a hobby? is it really worth having GSOC slots for Debian?

June 19, 2012

Deferring the execution of a command

More than once, I have found myself running a program that calls a given script multiple times, and sometimes, way too many times. In some cases, for instance, it was only necessary to run the script once in one minute, instead of hundreds of times.
Sounds like I want something like dpkg's deferred triggers, right? pretty much so.

With this need, deferred-run was born. Example:

$ deferred-run -l lock -- echo hello world
$ deferred-run -l lock -- echo hello world
$ sleep 3
$ deferred-run -l lock -- echo hello world
[some seconds later...]
hello world

In spite of multiple calls, echo was only called once.
Best part, it only requires sh and flock.

If you are interested in shell trickery, you should take a look at its code.

May 29, 2012

ldebdiff: what local change did I make to a package?

When fixing or modifying a package I at times change the installed files directly. However, once the changes are okay and it is time to prepare a diff, having to keep track and manually diff'ing the installed files is time consuming.

That's when ldebdiff does its thing: it runs diff against the original files of a given package and the ones installed on a system.
For example:

$ ldebdiff acpi-fakekey
--- unpacked/etc/init.d/acpi-fakekey    2012-04-05 05:14:21.000000000 -0500
+++ /etc/init.d/acpi-fakekey    2012-04-12 12:10:55.000000000 -0500
@@ -4,8 +4,8 @@
 
 ### BEGIN INIT INFO
 # Provides:          acpi-fakekey
-# Required-Start:    $local_fs $remote_fs
-# Required-Stop:     $local_fs $remote_fs
+# Required-Start:    $remote_fs
+# Required-Stop:     $remote_fs
 # Default-Start:     2 3 4 5
 # Default-Stop:      
 # Short-Description: Start acpi_fakekey daemon

($remote_fs implies $local_fs)

May 23, 2012

Installed-Size: 0

Redefining the meaning of Installed-Size:

Package: gcipher
Version: 1.1-1
Installed-Size: 0

$ dget gcipher=1.1-1
...
$ dpkg -x gcipher_1.1-1_all.deb gcipher
$ du -hs gcipher
188K    gcipher
$ du --apparent-size -hs gcipher
75K     gcipher

May 22, 2012

On mirrors and why there are so few

Yes, I said few.

Even if in my blog post about Debian's ever-growing mirrors network it seemed like we have a lot of mirrors, truth be told: they are too few. There are many partial or even complete mirrors out there that are not listed.

If you are the administrator of one of those mirrors, please consider submitting your Debian mirror. It helps keep track of them and expose them to more users.

Granted, there are some that due to policies (on remote connections, bandwidth use, etc.) they are kept private. Hidden behind a LAN, hoping that people inside actually use them. As of this time, there's not much that can be done about them. If they were to be listed somewhere, there's no way for somebody to check them from the outside.

Some other mirror networks require that an IP (or a range) be whitelisted, allowing them to access the mirrors from the outside to perform routine checks. For some reason, I thought Fedora's was one of them but, apparently, they use a different approach: they ask private mirrors to run a tool.

I think that each approach has its pros and cons.

April 30, 2012

An ever-growing mirrors network

As of the time of writing, Debian's mirrors network has around 330 mirrors serving the archive over http, and around 300 serving it over ftp.

This month alone, 6 more mirrors were added. Other 6 more were added last month.

There are mirrors in 73 countries. All this couldn't be possible without the help of the sponsors.

However, some interesting questions arise: do people actually use those new mirrors? is having so many mirrors worth the extra load put on the primary mirrors?

I would personally answer: no, and maybe.

Blogosphere: When was the last time you took a look at the mirrors list and/or tried to find a mirror that served you better?

April 28, 2012

Your APT caching proxy is not that efficient

So you read about an "APT caching proxy" and that it can "save time and network bandwidth."
Sounds like something you should have, right? After all, the trade-off is just some additional disk space. APT already has its own cache at /var/cache/apt/archives/, so part of what the caching proxy would store in disk is already there.

Truth is, for a single client a caching proxy might not provide any benefit and might even cause a slowdown, but for at least one of the alternatives it is even worse: every request that goes through the proxy that requires a download from a mirror creates a new connection.

It is 2012 and a piece of software that aims to "save time and network bandwidth" can't even use keep-alive connections?

You are doing it wrong...

... when you join a team and then every other team member stops working.

(There's a possibility that you are doing such a great job that there's no need of the others, but think about the case when your work or your way of working is killing them.)

March 29, 2012

the bash way is faster, but only with bash

Some bashisms are syntax sugar at first sight, such as the += concatenation syntax. Usually, they happen to be faster than their more portable counterparts. But only with bash itself.

Take the following script as an example:

#!/bin/sh
# script1-portable.sh
part="$(seq 1 100000)"

for i in $(seq 1 10); do
seq="${part}"
seq="${seq}${part}"
done

$ time bash script1-portable.sh
user 0m20.837s

Now, compare to the following script that uses += :

#!/bin/sh
# script1-bash.sh
part="$(seq 1 100000)"

for i in $(seq 1 10); do
seq="${part}"
seq+="${part}"
done

$ time bash script1-bash.sh
user 0m14.227s

Yes, it's faster. However, when the first script is run with dash:

$ time dash script1-portable.sh
user    0m0.609s

[[ is another example:

#!/bin/sh
# script2-portable.sh
a="$(seq 1 100000)"; b="$(seq 1 100)"

for i in $(seq 1 10); do
[ "$a" = "$b" ]
done

$ time bash script2-portable.sh
user    0m9.148s

And the version using the bashism:

#!/bin/sh
# script2-bash.sh
a="$(seq 1 100000)"; b="$(seq 1 100)"

for i in $(seq 1 10); do
[[ $a = $b ]]
done

$ time bash script2-bash.sh
user    0m4.223s

Then again, the bash way is faster, yet it doesn't compare to dash:

$ time dash script2-portable.sh
user    0m0.588s

tainted foolishness

You enable tainting checks, you hack some code to untaint the input, you run the code and no error message appears: you succeeded.

Did you?

No, of course you didn't: not triggering any tainted data warning (or error) doesn't guarantee that your code is "safe". It doesn't mean you can take untrusted input and handle it correctly. It doesn't mean there are no code arbitrary execution vulnerabilities, no XSS vulns, no SQL injections, you name it.

What does it mean then? that, is left as an exercise to the reader.

I have a blog, again

I'm back with a blog.

After being inactive for a while on my previous blog, hosted at my.opera.com, they apparently nuked my account. Blog, pictures, and other files, all gone; no prior notification. I'm rather sad of such lack of courtesy.
Even when some months ago a friend asked me about my blog (by saying something along the lines of it being unavailable) it never occurred to me that it had been cancelled. It was until I tried to post about something the other day that I noticed it was gone, for real.

There doesn't even appear to be a way to recover the account.

Anyway, time for a fresh start (sort of). This time at blogger. Hello blogosphere, hello world.