Internet mirroring

Many of the documents here talk about downloading internet materials, and applying some customisation to use them.

I do the download to /mirror and edits such as hacking source or compiling them go in /union, these changes are captured in /overlay which can be backed up leaving the original mirror to not be backed up, as the original internet path is captured in the system. The probability of losing both /mirror and the resource disappearing off the internet is quite low, but to further hedge against this, get a few internet archives to back up the resource of interest.



The new kernel overlayfs expected to be faster than userspace, though unionfs.fuse can be NFS exported.


  1. mkdir -p /mirror/cvs/cvs.example/cvsroot/example;
  2. pushd /mirror/cvs/cvs.example/cvsroot/example;
  3. cvs -z3 -d:pserver:anonymous@cvs.example:/cvsroot/example co -P;
  4. popd;


Would use svnsync

  1. TO=/mirror/http/my.example/svn/trunk/
  2. FROM=http://my.example/svn/trunk/
  3. mkdir -p ${TO}
  4. svnadmin create ${TO}
  5. echo '#!/bin/sh' > ${TO}hooks/pre-revprop-change
  6. chmod 755 ${TO}hooks/pre-revprop-change
  7. svnsync init file://${TO} ${FROM}
  8. svnsync sync file://${TO} ${FROM}


First, a function, for when autoreconf is not aggressive enough.

autoreconfer () 
    libtoolize --force;
    automake --force-missing --add-missing;
  1. git clone --mirror -v git://git.example/repo.git /mirror/git/git.example/repo.git;
  2. rm -rv /union/git/git.example/repo.git;
  3. git clone /mirror/git/git.example/repo.git /union/git/git.example/repo.git;
  4. pushd /union/git/git.example/repo.git;
  5. autoreconfer;
  6. ./configure --prefix=/union/git/git.example/repo.out;
  7. make install;

Generic http sources

Now wget can be made to fetch for mirroring, place something in a wgetrc to get paths like /mirror/http/ and this pattern follows for all other protocols including ftp and git.

Backing up http response headers to mirror with wget

Wget obtains http headers during download, these can be saved with --save-headers though prepends them to downloads. This files are then less useful to most other programs and are even unrecognisable to wget itself.

The other way to get at headers is to tell wget to send logging to a file, then extract the headers from that. It is possible to workaround not seing interactive logging via using a fifo and getting the log processor display progress, or else use tee, or tail the logfile.

I tend to save them in a single http xattr, that also preserves any information inferrable from header order.

The header can then be accessed for purposes such as:



while IFS= read -r REPLY
 KEY=`echo "${REPLY}" | cut -c1-2`
 if test "${REPLY}" = "Proxy request sent, awaiting response... "
 if test "${KEY}" = "  " -a "${INHEADERS}" -eq 1
  HEADER=`echo "${REPLY}" | cut -c3-`
 if test "${KEY}" = "Sa"
  FILE=`echo "${REPLY}" | cut -b15- | rev | cut -b4- | rev`
  HTTP="`echo "${HEADERS}" | tail -c +3`"
  attr -s http -V "${HTTP}" "${FILE}"
done < ~/.wget-log

The same is possible with curl, grab a download and copy the http headers.

  1. curl ()
  2. {
  3. U=/mirror/`echo "${1}" | sed s_://_/_`;
  4. mkdir -p `dirname $U`;
  5. H="`/usr/bin/curl --create-dirs -D /dev/stdout --output "${U}" "${@}"`";
  6. attr -s http -V "${H}" "${U}"
  7. }