A few days ago, I had decided to revise my data backup methods, so to be able to easily recover as much data as possible after a partial corruption of the medium, a DVD that is, on which the data has been stored. I should clarify that by corruption I by no means include the possibility of mechanical damage of the medium. After some reasearch on the web, some questions on mailing lists and IRC channels, the quest ended with two formats to choose from, tar and cpio.
What I need more when it comes to partial corruption of a backup is to be able to easily extract the healthy archived files. In order to finally make a decision about which format I would finally choose, I performed the following tests:
- Tests using tar:
- Random 1-byte corruption.
- Partial corruption of one of the archived files metadata.
- Tests with cpio:
- Random 1-byte corruption.
- Total corruption of one of the archived files metadata. (same result with partial header corruption)
Information about the two formats was found at the following web pages:
- CPIO specification (New
ASCII
format withCRC
added) - TAR specification (
USTAR
format)
The following tests assume the directory and file structure outlined below:
WORKING_DIR/ bak/ 1.pdf 2.pdf 3.pdf
Before continuing I would like to thank the folks at the Linux-Greek-Users mailing list for their advice and ideas. I had initially posted the following material in the LGU list.
TAR Tests
Testing corruption of tar archives.
Random 1-byte corruption of the tar archive
In this test one random byte of the archive was replaced by a zero (0).
$ md5sum bak/* 11875e4e35a40686d81a37aa448aac2e bak/1.pdf 30c63be455dbada1ffc985c5465d0723 bak/2.pdf 096dc1c77a2a0f4d9f953abd7264843f bak/3.pdf
$ tar -cvf bak.tar bak/ bak/ bak/2.pdf bak/3.pdf bak/1.pdf
$ tar -dvf bak.tar bak/ bak/ bak/2.pdf bak/3.pdf bak/1.pdf
$ python -c 'f=open("bak.tar","r+"); f.seek(12334); f.write("0"); f.close()'
$ tar -dvf bak.tar bak/ bak/ bak/2.pdf bak/3.pdf bak/3.pdf: Contents differ bak/1.pdf
$ mkdir out
$ tar -xvf bak.tar -C out/ bak/ bak/2.pdf bak/3.pdf bak/1.pdf
$ md5sum out/bak/* 11875e4e35a40686d81a37aa448aac2e out/bak/1.pdf 30c63be455dbada1ffc985c5465d0723 out/bak/2.pdf 2d0b2aa54047d6e97b45fbb43f8f1bdc out/bak/3.pdf
Conclusion: The md5 sums of the original 3.pdf and the extracted 3.pdf differ. The rest of the files has been extracted accurately.
Partial corruption of one of the archived files metadata
In this test, 200 bytes of the total 500 bytes of metadata of the 2nd archived file are destroyed. Note that the 1st archived file is the directory bak/
$ md5sum bak/* b0ec395ca8cb79f2ce98397ec0e00981 bak/1.pdf fbe2f3f799579251682ee6de0e4d828d bak/2.pdf afb18f2dbbb43673c641691b458dbcce bak/3.pdf
$ tar -cvf bak.tar bak/ bak/ bak/2.pdf bak/3.pdf bak/1.pdf
$ tar -dvf bak.tar bak/ bak/ bak/2.pdf bak/3.pdf bak/1.pdf
In USTAR format, metadata occupy 500 bytes. The tar magic string starts at position 257 after the metadata start position. In this test, as it was already mentioned, 200 bytes of data are destroyed (range 200->400):
$ python Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11) [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> magic = "ustar \x00" >>> f = open("bak.tar", "rb+") >>> magic2_pos = f.read().find(magic, 258) >>> meta2_start = magic2_pos - 57 >>> f.seek(meta2_start) >>> f.write("0"*200) >>> f.close() >>>
$ tar -dvf bak.tar bak/ bak/ tar: Skipping to next header bak/3.pdf bak/1.pdf tar: Error exit delayed from previous errors
$ mkdir out
$ tar -xvf bak.tar -C out/ bak/ tar: Skipping to next header bak/3.pdf bak/1.pdf tar: Error exit delayed from previous errors
$ md5sum out/bak/* b0ec395ca8cb79f2ce98397ec0e00981 out/bak/1.pdf afb18f2dbbb43673c641691b458dbcce out/bak/3.pdf
Conclusion: Although one of the archived files metadata has been destroyed, tar has managed to successfully extract the rest of the files, regardless of the fact that they were after the corrupted part of the archive. The success of the extraction is confirmed by comparing the extracted files’ md5 sums with the chewcksums of the original files.
CPIO Tests
Testing corruption of cpio archives.
Random 1-byte corruption of the cpio archive
In this test one random byte of the archive was replaced by a zero (0).
$ md5sum bak/* 11875e4e35a40686d81a37aa448aac2e bak/1.pdf 30c63be455dbada1ffc985c5465d0723 bak/2.pdf 096dc1c77a2a0f4d9f953abd7264843f bak/3.pdf
$ find bak/ | cpio -v -o -H crc > bak.cpio bak/ bak/2.pdf bak/3.pdf bak/1.pdf 25919 blocks
$ cpio -vi --only-verify-crc < bak.cpio bak/ bak/2.pdf bak/3.pdf bak/1.pdf 25919 blocks
$ python -c 'f=open("bak.tar","r+"); f.seek(12334); f.write("0"); f.close()'
$ cpio -v -i --only-verify-crc < bak.cpio bak/ bak/2.pdf cpio: bak/3.pdf: checksum error (0x2b7dbd48, should be 0x2b7dbda8) bak/3.pdf bak/1.pdf 25919 blocks
$ mkdir out2
$ cd out2/
$ cpio -vid < ../bak.cpio bak bak/2.pdf cpio: bak/3.pdf: checksum error (0x2b7dbd48, should be 0x2b7dbda8) bak/3.pdf bak/1.pdf 25919 blocks
$ cd ..
$ md5sum out2/bak/* 11875e4e35a40686d81a37aa448aac2e out2/bak/1.pdf 30c63be455dbada1ffc985c5465d0723 out2/bak/2.pdf cd9ea8e6298a42f44b59322b31e55958 out2/bak/3.pdf
Conclusion: The md5 sums of the original 3.pdf and the extracted 3.pdf differ. The rest of the files has been extracted accurately.
Corruption of one the archived files metadata
In this test the metadata of one of the archived files is destroyed.
$ md5sum bak/* 11875e4e35a40686d81a37aa448aac2e bak/1.pdf 30c63be455dbada1ffc985c5465d0723 bak/2.pdf 096dc1c77a2a0f4d9f953abd7264843f bak/3.pdf
$ find bak/ | cpio -v -o -H crc > bak.cpio bak/ bak/2.pdf bak/3.pdf bak/1.pdf 25919 blocks
$ python Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11) [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> magic = "070702" >>> f = open("bak.cpio", "r+") >>> magic2_pos = f.read().find(magic, 1) >>> f.seek(magic2_pos) >>> metadata_length = magic2_pos + 6 + 13*8 + 4 # 4: μέρος του pathname >>> f.write("0"*metadata_length) >>> f.close() >>>
$ cpio -v -i --only-verify-crc < bak.cpio bak/ cpio: premature end of file
$ mkdir out3
$ cd out3
$ cpio -vid < ../bak.cpio bak cpio: premature end of file
Conclusion: Neither verification nor extraction. cpio (at least Fedora's version) does not have the ability to skip to a healthy header and the operation ends prematurely. The use of a recovery tool in order to recover the healthy files within the archives is mandatory.
Conclusion
Here follow the pros and cons (this is not a complete list) of each format:
CPIO
+ per-file CRC checksum. The backed up data on the DVD can be verified in-place without the need of any 3rd party software.
+ No limit for pathnames.
- when the cpio archive gets partially corrupted, as it can happen on a DVD, then the cpio program cannot skip the damaged files and move on to the next healthy archived file. The use of recovery software is needed.
- you have to use the find command's tests in order to include/exclude files in/from the archive.
- It cannot save extended attributes.
TAR
+ Even if some part of the archive gets corrupted, the tar program can skip to the next healthy archived file and extract it. This is very important as it eliminates the need of the 3rd party recovery software.
+ File and directory inclusions/exclusions are possible with command-line options and with file/dir lists read from a file.
+ It can save extended attributes, but 3rd party software may not be able to read the archive correctly.
- No CRC checksum is saved, so checking the data in-place requires two things: to have kept the checksums of the archived files and to have an external program that can check those checksums against the archived data. If this is not possible, then keeping the data on the hard drive in addition to the backup is needed in order to compare them using tar's -d switch.
- The maximum length of a pathname in the USTAR format is 156 bytes.
It is obvious that both of the two formats and/or programs are incomplete. The pros of one are the cons of the other. This was rather a surprise.
My final choice was the tar format because I consider the fact that it does not need a 3rd party program to extract the data from a damaged archive a great advantage. I have also created an utility, Veritar, that can verify the md5 sums of the files inside a tar archive with the md5sums that have been kept in a separate file during the creation of the archive. More information in my upcoming post about tar crc/md5 verification....
Choosing a format for data backups – tar vs cpio by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2007 - Some Rights Reserved