Internet mirroring

Many of the documents here talk about downloading internet materials, and applying some customisation to use them.

I do the download to /mirror and edits such as hacking source or compiling them go in /union, these changes are captured in /overlay which can be backed up leaving the original mirror to not be backed up, as the original internet path is captured in the system. The probability of losing both /mirror and the resource disappearing off the internet is quite low, but to further hedge against this, get a few internet archives to back up the resource of interest.

Setup

/etc/fstab
WhatWhereHowOptionsDumpPass
/dev/mapper/fs-mirror/mirrorext4user_xattr,acl,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group12
/dev/mapper/fs-overlay/overlayext4user_xattr,acl,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group12
/overlay=RW:/mirror=RO/unionfuse.unionfs-fuseallow_other,cow,stats00

CVS

  1. mkdir -p /mirror/cvs/cvs.example/cvsroot/example;
  2. pushd /mirror/cvs/cvs.example/cvsroot/example;
  3. cvs -z3 -d:pserver:anonymous@cvs.example:/cvsroot/example co -P;
  4. popd;

subversion

Would use svnsync

  1. TO=/mirror/http/my.example/svn/trunk/
  2. FROM=http://my.example/svn/trunk/
  3. mkdir -p ${TO}
  4. svnadmin create ${TO}
  5. echo '#!/bin/sh' > ${TO}hooks/pre-revprop-change
  6. chmod 755 ${TO}hooks/pre-revprop-change
  7. svnsync init file://${TO} ${FROM}
  8. svnsync sync file://${TO} ${FROM}

GIT

First, a function, for when autoreconf is not aggressive enough.

autoreconfer () 
{ 
    libtoolize --force;
    aclocal;
    autoheader;
    automake --force-missing --add-missing;
    autoconf;
}
  1. git clone --mirror -v git://git.example/repo.git /mirror/git/git.example/repo.git;
  2. rm -rv /union/git/git.example/repo.git;
  3. git clone /mirror/git/git.example/repo.git /union/git/git.example/repo.git;
  4. pushd /union/git/git.example/repo.git;
  5. autoreconfer;
  6. ./configure --prefix=/union/git/git.example/repo.out;
  7. make install;

Generic http sources

Now wget can be made to fetch for mirroring, place something in a wgetrc to get paths like /mirror/http/www.example.com/some/example. and this pattern follows for all other protocols including ftp and git.

Backing up http response headers to mirror with wget

Wget obtains http headers during download, these can be saved with --save-headers though prepends them to downloads. This files are then less useful to most other programs and are even unrecognisable to wget itself.

The other way to get at headers is to tell wget to send logging to a file, then extract the headers from that. It is possible to workaround not seing interactive logging via using a fifo and getting the log processor display progress, or else use tee, or tail the logfile.

I tend to save them in a single http xattr, that also preserves any information inferrable from header order.

The header can then be accessed for purposes such as:

#!/bin/sh

INHEADERS=0

while IFS= read -r REPLY
do
 KEY=`echo "${REPLY}" | cut -c1-2`
 if test "${REPLY}" = "Proxy request sent, awaiting response... "
 then
  HEADERS=
  INHEADERS=1
 fi
 if test "${KEY}" = "  " -a "${INHEADERS}" -eq 1
 then
  HEADER=`echo "${REPLY}" | cut -c3-`
  HEADERS="${HEADERS}"'\r\n'"${HEADER}"
 fi
 if test "${KEY}" = "Sa"
 then
  FILE=`echo "${REPLY}" | cut -b15- | rev | cut -b4- | rev`
  HTTP="`echo "${HEADERS}" | tail -c +3`"
  attr -s http -V "${HTTP}" "${FILE}"
 fi
done < ~/.wget-log