Many of the documents here talk about downloading internet materials, and applying some customisation to use them.
I do the download to /var/local/mirror and edits such as hacking source or compiling them go in /var/local/union, these changes are captured in /var/local/overlay which can be backed up leaving the original mirror to not be backed up, as the original internet path is captured in the system. The probability of losing both /var/local/mirror and the resource disappearing off the internet is quite low, but to further hedge against this, get a few internet archives to back up the resource of interest.
Another good reason is for security, and privacy, cut off the internet access from the lab.
The tradeoff is that storing huge amount of download, most of which might never be used, then again television broadcasts much content that is not watched, so it is not that bad.
What | Where | How | Options | Dump | Pass |
---|---|---|---|---|---|
/dev/mapper/fs-mirror | /var/local/mirror | ext4 | noauto,x-systemd.automount,user_xattr,acl,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group | 1 | 2 |
/dev/mapper/fs-overlay | /var/local/overlay | ext4 | noauto,x-systemd.automount,user_xattr,acl,jqfmt=vfsv1,usrjquota=aquota.user,grpjquota=aquota.group | 1 | 2 |
overlay | /var/local/union | overlay | noauto,x-systemd.automount,lowerdir=/var/local/mirror,upperdir=/var/local/overlay/upper,workdir=/var/local/overlay/work | 0 | 0 |
/var/local/overlay/upper=RW:/var/local/mirror=RO | /var/local/union | fuse.unionfs-fuse | noauto,x-systemd.automount,allow_other,cow,max_files=32768 | 0 | 0 |
The new kernel overlayfs expected to be faster than userspace, though unionfs.fuse can be NFS exported.
The favourite, aside from one way multicasting, of inter-mirror transports is rsync and it can transport other content also verbatim.
It is worth talking about the options, we are -recursive, -verbose, and take -links, -times, -Executability, -Hard links and naturally, -(e)Xtended attributes. I usually don't carry -Acls and other permission data between mirrors.
We report -Progress and usually use -Relative to place the destination by module, even when only a subset of the source is interesting. As I am writing an example it includes the do -nothing switch, which is removed to do for real.
Some addtional options can be added, like socket options = IPTOS_THROUGHPUT,SO_RCVBUF=,SO_SNDBUF= with suitably big numbers
Of course, if the upstream is entrusted to track cool urls, i.e. you run it! then content may be placed directly at the root of the downstream mirror.
Would use svnsync
some git repo have submodules, pull them into the mirror. github have used, then abandoned port 9418, not even via ipsec, so use a sed 's_git://github.com/_https://github.com/_g' to redirect to https for them
thirstydoggie () { URLS=$(for REPO in /var/local/mirror/https/git*/*/*.git do git -C $REPO show HEAD:.gitmodules | grep '^\s*url' | cut -d "=" -f2- | tr -d " " | \ sed -s 's_git://github.com/_https://github.com/_g' done) for URL in ${URLS} do M=/var/local/mirror/$(echo $URL | sed s_://_/_) if test -e $M then echo git -C $M remote --verbose update else echo git clone --mirror $URL $M fi done }
If the repo uses submodules, adjust the urls therein to point at the mirror, before doing the submodule pull.
First, a function, for when autoreconf is not aggressive enough.
autoreconfer () { libtoolize --force; aclocal; autoheader; automake --force-missing --add-missing; autoconf; }
Now wget can be made to fetch for mirroring, place something in a wgetrc to get paths like /var/local/mirror/http/www.example.com/some/example. and this pattern follows for all other protocols including ftp and git.
Wget obtains http headers during download, these can be saved with --save-headers though prepends them to downloads. This files are then less useful to most other programs and are even unrecognisable to wget itself.
The other way to get at headers is to tell wget to send logging to a file, then extract the headers from that. It is possible to workaround not seing interactive logging via using a fifo and getting the log processor display progress, or else use tee, or tail the logfile.
I tend to save them in a single http xattr, that also preserves any information inferrable from header order.
The header can then be accessed for purposes such as:
#!/bin/sh INHEADERS=0 while IFS= read -r REPLY do KEY=`echo "${REPLY}" | cut -c1-2` if test "${REPLY}" = "Proxy request sent, awaiting response... " then HEADERS= INHEADERS=1 fi if test "${KEY}" = " " -a "${INHEADERS}" -eq 1 then HEADER=`echo "${REPLY}" | cut -c3-` HEADERS="${HEADERS}"'\r\n'"${HEADER}" fi if test "${KEY}" = "Sa" then FILE=`echo "${REPLY}" | cut -b15- | rev | cut -b4- | rev` HTTP="`echo "${HEADERS}" | tail -c +3`" attr -s http -V "${HTTP}" "${FILE}" fi done < ~/.wget-log
The same is possible with curl, grab a download for the mirror and copy the http headers while we are at it.
This is especially useful with F12 developer tools in firefox and chromium browsers including google chrome and new microsoft edge. And paste some javascript for evasive websites that try to open downloads in new tabs.
From offline, lets have a real internal network, and not trust that can be overridden.