Internet mirroring

Many of the documents here talk about downloading internet materials, and applying some customisation to use them.

I do the download to /var/local/mirror and edits such as hacking source or compiling them go in /var/local/union, these changes are captured in /var/local/overlay which can be backed up leaving the original mirror to not be backed up, as the original internet path is captured in the system. The probability of losing both /var/local/mirror and the resource disappearing off the internet is quite low, but to further hedge against this, get a few internet archives to back up the resource of interest.

Another good reason is for security, and privacy, cut off the internet access from the lab.

The tradeoff is that storing huge amount of download, most of which might never be used, then again television broadcasts much content that is not watched, so it is not that bad.

ping 192.88.99.1
connect: Network is unreachable

Setup

/etc/fstab
What	Where	How	Options	Dump	Pass
/dev/mapper/fs-mirror	/var/local/mirror	ext4	noauto,x-systemd.automount,user_xattr,acl,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group	1	2
/dev/mapper/fs-overlay	/var/local/overlay	ext4	noauto,x-systemd.automount,user_xattr,acl,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group	1	2
overlay	/var/local/union	overlay	noauto,x-systemd.automount,lowerdir=/var/local/mirror,upperdir=/var/local/overlay/upper,workdir=/var/local/overlay/work	0	0
/var/local/overlay/upper=RW:/var/local/mirror=RO	/var/local/union	fuse.unionfs-fuse	noauto,x-systemd.automount,allow_other,cow,max_files=`32768`	0	0

The new kernel overlayfs expected to be faster than userspace, though unionfs.fuse can be NFS exported.

RSYNC

The favourite, aside from one way multicasting, of inter-mirror transports is rsync and it can transport other content also verbatim.

rsync -rnvltPRHEX rsync.example::upstream-module /var/local/mirror/rsync/rsync.example/upstream-module

It is worth talking about the options, we are -recursive, -verbose, and take -links, -times, -Executability, -Hard links and naturally, -(e)Xtended attributes. I usually don't carry -Acls and other permission data between mirrors.

We report -Progress and usually use -Relative to place the destination by module, even when only a subset of the source is interesting. As I am writing an example it includes the do -nothing switch, which is removed to do for real.

Some addtional options can be added, like socket options = IPTOS_THROUGHPUT,SO_RCVBUF=,SO_SNDBUF= with suitably big numbers

Of course, if the upstream is entrusted to track cool urls, i.e. you run it! then content may be placed directly at the root of the downstream mirror.

rsync -rnvltPRHEX rsync.example::mirror /var/local/mirror

CVS

mkdir -p /var/local/mirror/cvs/cvs.example/cvsroot/example;
pushd /var/local/mirror/cvs/cvs.example/cvsroot/example;
cvs -z3 -d:pserver:anonymous@cvs.example:/cvsroot/example co -P;
popd;

subversion

Would use svnsync

mirror_svn () {
chrt -b -p 0 $$
ionice -c 2 -n 7 -p $$
renice -n 19 -p $$
FROM=$1
RAW=$(echo ${FROM} | cut -d":" -f1)/$(echo ${FROM} | cut -d"/" -f3-)
TO=/var/local/mirror/${RAW}
mkdir -p ${TO}
svnadmin create ${TO}
mkdir -p ${TO}/hooks
echo '#!/bin/sh' > ${TO}/hooks/pre-revprop-change
chmod 755 ${TO}/hooks/pre-revprop-change
svnsync init file://${TO} ${FROM}
#svnsync sync file://${TO} ${FROM}
svnsync sync file://${TO}
svn co {file://var/local/mirror,/var/local/unionfs}/${RAW}
}

GIT

git clone --mirror -v git://git.example/repo.git /var/local/mirror/git/git.example/repo.git;

some git repo have submodules, pull them into the mirror. github have used, then abandoned port 9418, not even via ipsec, so use a sed 's_git://github.com/_https://github.com/_g' to redirect to https for them

thirstydoggie () {
  URLS=$(for REPO in /var/local/mirror/https/git*/*/*.git
  do git -C $REPO show HEAD:.gitmodules | grep '^\s*url' | cut -d "=" -f2- | tr -d " " | \
        sed -s 's_git://github.com/_https://github.com/_g'
  done)
  for URL in ${URLS}
  do 
   M=/var/local/mirror/$(echo $URL | sed s_://_/_)
   if test -e $M
   then
    echo git -C $M remote --verbose update
   else
    echo git clone --mirror $URL $M
   fi
  done
}

build from git in private

If the repo uses submodules, adjust the urls therein to point at the mirror, before doing the submodule pull.

rm -rv /var/local/union/git/git.example/repo.git;
git clone /var/local/mirror/git/git.example/repo.git /var/local/union/git/git.example/repo.git;
pushd /var/local/union/git/git.example/repo.git;
autoreconfer;
./configure --prefix=/var/local/union/git/git.example/repo.out;
make install;

First, a function, for when autoreconf is not aggressive enough.

autoreconfer () 
{ 
    libtoolize --force;
    aclocal;
    autoheader;
    automake --force-missing --add-missing;
    autoconf;
}

Generic http sources

Now wget can be made to fetch for mirroring, place something in a wgetrc to get paths like /var/local/mirror/http/www.example.com/some/example. and this pattern follows for all other protocols including ftp and git.

prefer-family=IPv6
dirstruct=on
protocol_directories=on
# continue would need If-Modified-Since handling...
continue=off
dir_prefix=/var/local/mirror
tries=1
max-redirect=0
content_disposition = on
#content_on_error = on
logfile=~/.wget-log
server_response = on
trust_server_names = on

Backing up http response headers to mirror with wget

Wget obtains http headers during download, these can be saved with --save-headers though prepends them to downloads. This files are then less useful to most other programs and are even unrecognisable to wget itself.

The other way to get at headers is to tell wget to send logging to a file, then extract the headers from that. It is possible to workaround not seing interactive logging via using a fifo and getting the log processor display progress, or else use tee, or tail the logfile.

I tend to save them in a single http xattr, that also preserves any information inferrable from header order.

The header can then be accessed for purposes such as:

correcting file modification timestamps
check the file length is still correct
check any supplied hashes, like Content-MD5

#!/bin/sh

INHEADERS=0

while IFS= read -r REPLY
do
 KEY=`echo "${REPLY}" | cut -c1-2`
 if test "${REPLY}" = "Proxy request sent, awaiting response... "
 then
  HEADERS=
  INHEADERS=1
 fi
 if test "${KEY}" = "  " -a "${INHEADERS}" -eq 1
 then
  HEADER=`echo "${REPLY}" | cut -c3-`
  HEADERS="${HEADERS}"'\r\n'"${HEADER}"
 fi
 if test "${KEY}" = "Sa"
 then
  FILE=`echo "${REPLY}" | cut -b15- | rev | cut -b4- | rev`
  HTTP="`echo "${HEADERS}" | tail -c +3`"
  attr -s http -V "${HTTP}" "${FILE}"
 fi
done < ~/.wget-log

The same is possible with curl, grab a download for the mirror and copy the http headers while we are at it.

This is especially useful with F12 developer tools in firefox and chromium browsers including google chrome and new microsoft edge. And paste some javascript for evasive websites that try to open downloads in new tabs.

curl ()
{
U=/var/local/mirror/`echo "${1}" | sed s_://_/_`;
mkdir -p `dirname $U`;
H="`/usr/bin/curl --create-dirs -D /dev/stdout --output "${U}" "${@}"`";
attr -s http -V "${H}" "${U}"
}

fwupd

From offline, lets have a real internal network, and not trust that can be overridden.

python3 /var/local/mirror/https/gitlab.com/fwupd/lvfs-website/raw/master/contrib/sync-pulp.py \
https://cdn.fwupd.org/downloads /var/local/mirror/https/cdn.fwupd.org/downloads/