Choosing a format for data backups – tar vs cpio

A few days ago, I had decided to revise my data backup methods, so to be able to easily recover as much data as possible after a partial corruption of the medium, a DVD that is, on which the data has been stored. I should clarify that by corruption I by no means include the possibility of mechanical damage of the medium. After some reasearch on the web, some questions on mailing lists and IRC channels, the quest ended with two formats to choose from, tar and cpio.

What I need more when it comes to partial corruption of a backup is to be able to easily extract the healthy archived files. In order to finally make a decision about which format I would finally choose, I performed the following tests:

Tests using tar:
1. Random 1-byte corruption.
2. Partial corruption of one of the archived files metadata.
Tests with cpio:
1. Random 1-byte corruption.
2. Total corruption of one of the archived files metadata. (same result with partial header corruption)

Information about the two formats was found at the following web pages:

CPIO specification (New ASCII format with CRC added)
TAR specification (USTAR format)

The following tests assume the directory and file structure outlined below:

WORKING_DIR/
          bak/
               1.pdf
               2.pdf
               3.pdf

Before continuing I would like to thank the folks at the Linux-Greek-Users mailing list for their advice and ideas. I had initially posted the following material in the LGU list.

TAR Tests

Testing corruption of tar archives.

Random 1-byte corruption of the tar archive

In this test one random byte of the archive was replaced by a zero (0).

$ md5sum bak/*
11875e4e35a40686d81a37aa448aac2e  bak/1.pdf
30c63be455dbada1ffc985c5465d0723  bak/2.pdf
096dc1c77a2a0f4d9f953abd7264843f  bak/3.pdf

$ tar -cvf bak.tar bak/
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf

$ tar -dvf bak.tar bak/
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf

$ python -c 'f=open("bak.tar","r+"); f.seek(12334); f.write("0"); f.close()'

$ tar -dvf bak.tar bak/
bak/
bak/2.pdf
bak/3.pdf
bak/3.pdf: Contents differ
bak/1.pdf

$ mkdir out

$ tar -xvf bak.tar -C out/
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf

$ md5sum out/bak/*
11875e4e35a40686d81a37aa448aac2e  out/bak/1.pdf
30c63be455dbada1ffc985c5465d0723  out/bak/2.pdf
2d0b2aa54047d6e97b45fbb43f8f1bdc  out/bak/3.pdf

Conclusion: The md5 sums of the original 3.pdf and the extracted 3.pdf differ. The rest of the files has been extracted accurately.

Partial corruption of one of the archived files metadata

In this test, 200 bytes of the total 500 bytes of metadata of the 2nd archived file are destroyed. Note that the 1st archived file is the directory bak/

$ md5sum bak/*
b0ec395ca8cb79f2ce98397ec0e00981  bak/1.pdf
fbe2f3f799579251682ee6de0e4d828d  bak/2.pdf
afb18f2dbbb43673c641691b458dbcce  bak/3.pdf

$ tar -cvf bak.tar bak/
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf

$ tar -dvf bak.tar bak/
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf

In USTAR format, metadata occupy 500 bytes. The tar magic string starts at position 257 after the metadata start position. In this test, as it was already mentioned, 200 bytes of data are destroyed (range 200->400):

$ python
Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11) 
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> magic = "ustar  \x00"
>>> f = open("bak.tar", "rb+")
>>> magic2_pos = f.read().find(magic, 258)
>>> meta2_start = magic2_pos - 57
>>> f.seek(meta2_start)
>>> f.write("0"*200)
>>> f.close()
>>>

$ tar -dvf bak.tar bak/
bak/
tar: Skipping to next header
bak/3.pdf
bak/1.pdf
tar: Error exit delayed from previous errors

$ mkdir out

$ tar -xvf bak.tar -C out/
bak/
tar: Skipping to next header
bak/3.pdf
bak/1.pdf
tar: Error exit delayed from previous errors

$ md5sum out/bak/*
b0ec395ca8cb79f2ce98397ec0e00981  out/bak/1.pdf
afb18f2dbbb43673c641691b458dbcce  out/bak/3.pdf

Conclusion: Although one of the archived files metadata has been destroyed, tar has managed to successfully extract the rest of the files, regardless of the fact that they were after the corrupted part of the archive. The success of the extraction is confirmed by comparing the extracted files’ md5 sums with the chewcksums of the original files.

CPIO Tests

Testing corruption of cpio archives.

Random 1-byte corruption of the cpio archive

In this test one random byte of the archive was replaced by a zero (0).

$ md5sum bak/*
11875e4e35a40686d81a37aa448aac2e  bak/1.pdf
30c63be455dbada1ffc985c5465d0723  bak/2.pdf
096dc1c77a2a0f4d9f953abd7264843f  bak/3.pdf

$ find bak/ | cpio -v -o -H crc > bak.cpio
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf
25919 blocks

$ cpio -vi --only-verify-crc < bak.cpio 
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf
25919 blocks

$ python -c 'f=open("bak.tar","r+"); f.seek(12334); f.write("0"); f.close()'

$ cpio -v -i --only-verify-crc < bak.cpio 
bak/
bak/2.pdf
cpio: bak/3.pdf: checksum error (0x2b7dbd48, should be 0x2b7dbda8)
bak/3.pdf
bak/1.pdf
25919 blocks

$ mkdir out2

$ cd out2/

$ cpio -vid < ../bak.cpio 
bak
bak/2.pdf
cpio: bak/3.pdf: checksum error (0x2b7dbd48, should be 0x2b7dbda8)
bak/3.pdf
bak/1.pdf
25919 blocks

$ cd ..

$ md5sum out2/bak/*
11875e4e35a40686d81a37aa448aac2e  out2/bak/1.pdf
30c63be455dbada1ffc985c5465d0723  out2/bak/2.pdf
cd9ea8e6298a42f44b59322b31e55958  out2/bak/3.pdf

Conclusion: The md5 sums of the original 3.pdf and the extracted 3.pdf differ. The rest of the files has been extracted accurately.

Corruption of one the archived files metadata

In this test the metadata of one of the archived files is destroyed.

$ md5sum bak/*
11875e4e35a40686d81a37aa448aac2e  bak/1.pdf
30c63be455dbada1ffc985c5465d0723  bak/2.pdf
096dc1c77a2a0f4d9f953abd7264843f  bak/3.pdf

$ find bak/ | cpio -v -o -H crc > bak.cpio
bak/
bak/2.pdf
bak/3.pdf
bak/1.pdf
25919 blocks

$ python
Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11) 
[GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> magic = "070702"
>>> f = open("bak.cpio", "r+")
>>> magic2_pos = f.read().find(magic, 1)
>>> f.seek(magic2_pos)
>>> metadata_length = magic2_pos + 6 + 13*8 + 4  # 4: μέρος του pathname
>>> f.write("0"*metadata_length)
>>> f.close()
>>>

$ cpio -v -i --only-verify-crc < bak.cpio 
bak/
cpio: premature end of file

$ mkdir out3

$ cd out3

$ cpio -vid < ../bak.cpio
bak
cpio: premature end of file

Conclusion: Neither verification nor extraction. cpio (at least Fedora's version) does not have the ability to skip to a healthy header and the operation ends prematurely. The use of a recovery tool in order to recover the healthy files within the archives is mandatory.

Conclusion

Here follow the pros and cons (this is not a complete list) of each format:

CPIO
+ per-file CRC checksum. The backed up data on the DVD can be verified in-place without the need of any 3rd party software.
+ No limit for pathnames.

- when the cpio archive gets partially corrupted, as it can happen on a DVD, then the cpio program cannot skip the damaged files and move on to the next healthy archived file. The use of recovery software is needed.
- you have to use the find command's tests in order to include/exclude files in/from the archive.
- It cannot save extended attributes.

TAR
+ Even if some part of the archive gets corrupted, the tar program can skip to the next healthy archived file and extract it. This is very important as it eliminates the need of the 3rd party recovery software.
+ File and directory inclusions/exclusions are possible with command-line options and with file/dir lists read from a file.
+ It can save extended attributes, but 3rd party software may not be able to read the archive correctly.

- No CRC checksum is saved, so checking the data in-place requires two things: to have kept the checksums of the archived files and to have an external program that can check those checksums against the archived data. If this is not possible, then keeping the data on the hard drive in addition to the backup is needed in order to compare them using tar's -d switch.
- The maximum length of a pathname in the USTAR format is 156 bytes.

It is obvious that both of the two formats and/or programs are incomplete. The pros of one are the cons of the other. This was rather a surprise.

My final choice was the tar format because I consider the fact that it does not need a 3rd party program to extract the data from a damaged archive a great advantage. I have also created an utility, Veritar, that can verify the md5 sums of the files inside a tar archive with the md5sums that have been kept in a separate file during the creation of the archive. More information in my upcoming post about tar crc/md5 verification....

Choosing a format for data backups – tar vs cpio by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2007 - Some Rights Reserved