In my opinion, the biggest problem of the tar format (‘ustar‘) is that it does not store the checksums of the files it contains. So, in order to be able to verify the contents of the tar archive, you either need to keep the original data on the hard drive and compare the archive contents against that data using the -d
tar switch or keep the MD5 sums of the files in a separate document and also use an external program in order to check them against the calculated MD5 sums of the archived files. In this short post I introduce you to a method of creating tar archives and keeping the md5sums of the files at the same time and a utility, veritar, which can compare those md5 sums with the checksums of the contents of the archive in-place, without the need to extract.
Creation of the TAR archive and the MD5 sums file
In the following example it is assumed that the files to backup reside in the myfiles/
subdirectory, the name of the tar archive will be mybackup.tar
and the name of the file containing the md5sums will be mybackup.md5
.
$ tar -cvpf mybackup.tar myfiles/ \ | xargs -I '{}' sh -c "test -f '{}' && md5sum '{}'" \ | tee mybackup.md5
Some notes:
- You can use any tar switch for the creation of the archive except -C. If you need to change to another directory, do it using cd or else no md5 sums will be recorded.
- Make sure that you include the -v (–verbose) switch when invoking tar, as the paths need to be printed to stdout in order to be processed by xargs.
- In the xargs statement, the -I ‘{}’ part indicates that the
'{}'
string will be replaced by the path that is passed to xargs through the pipe. - The sh -c “test -f ‘{}’ && md5sum ‘{}'” does two things: tests if the path (
'{}'
) is a file and calculates the md5 sum for it. - In the last part, tee is used in order to print the md5sum to the stdout and also to the
mybackup.md5
file.
When this operation ends, you will end up with two files: mybackup.tar and mybackup.md5.
Special thanks to:
* Anvil for the suggestion to use bash -c "...test goes here..."
stuff.
* Giorgos Keramidas for the improvement he suggested, so that the md5 sum calculation is not limited to regular files only:
sh -c "test -d '{}' || md5sum '{}'"
VeriTAR will verify the md5 sums of regular files only, so either test you use when creating the TAR archive, it is still fine.
VeriTAR – Tar archive verification
VeriTAR [Veri(fy)TAR
] is a command-line utility that verifies the md5 sums of files within a tar archive. Due to the tar (‘ustar
‘) format limitations the md5 sums are retrieved from a separate file and are checked against the md5 sums of the files within the tar archive. The process takes place without actually exctracting the files.
It works with corrupted tar archives. The program carries on to the next file within the archive skipping the damaged parts. At the moment, this relies
on Python’s tarfile module internal functions.
VeriTAR is written in Python.
Works with compressed TAR archives (gzip or bz2).
Provided that you have used the method above (or any other method) in order to create a file with the md5 sums together with the tar archive, you can easily verify the contents of the archive with veritar.
$ veritar mybackup.tar mybackup.md5
Please not that veritar’s output and command line switched need some work, but for now it does the job.
Veritar is released under the Apache License version 2.
It is completely unsupported, but you can still get community support at our software forums. This is also the place where you can inform me about any bugs.
Known issues
- Multi-volume tar archives are not supported at the moment
- Tar archives in which the metadata of the first archived file has been corrupted cannot be processed due to a limitation in the tarfile Python module at the time of writing
- Although the checksum of any algorithm, md5, sha1, crc(crc32), could be used, the current alpha version is not very flexible.
- It may crash on damaged archives on older python versions.
VeriTAR – Verify checksums of files within a TAR archive by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2007 - Some Rights Reserved
VeriTAR has been featured at http://www.linuxlinks.com
Steve, thanks, I appreciate it.
The command has problems with filenames with single quotes in them. I use :
tar -cvpf mybackup.tar myfiles/ \
| tr ‘\n’ ”
| xargs –null -I ‘{}’ sh -c “test -f ‘{}’ && md5sum ‘{}'” \
| tee mybackup.md5
Hello George,
Thank you for your work on veritar. I have found a case in which it is not working – file names with two consecutive blanks – and I have a simple patch to correct it, see below.
Alexandru
diff VeriTAR-orig.py VeriTAR.py
85c85,87
md5index=line.find(” “)
> csum=line[0:md5index]
> name=line[md5index:]
sorry, the diff was trashed by html; i am trying again surrounded with
code
tagsAlexandru
No luck again :( , anyway I have replaced line 85 in VeriTAR.py (csum, name = line.split(” “) ) with the next 3 lines
md5index=line.find(” “)
csum=line[0:md5index]
name=line[md5index:]
Alexandru
Hello,
I have been trying to use the veritar.py in a bash script and it seems to work, but I always get the following messages in the error file:
Could you please advice me what I should do to fix this?
Thank you.
Best,
Monica
Hello Monica,
Thanks for your feedback. I’m sorry for the late reply, but your comment for some reason had been caught by the spam filter.
As far as I can remember, I had developed this script using python 2.4. Maybe some things have changed since then in the
tarfile
module of the Standard Library. Time permitting I’ll try to take a look at it again. BTW, there is a 0.4.0 (refactored) release over at Github: https://github.com/gnotaras/veritar/releases I have no idea if it’s going to work though..The point is that I’ve stopped using or developing this script, so I highly recommend using an alternative utility that is actively maintained to get this job done.
Best Regards,
George