The KB has quite a large collection of offline optical media, such as CD-ROMs, DVDs and audio CDs. We’re currently investigating how to stabilise the contents of these materials using disk imaging. During the initial phase of this work I did a number of tests with various open-source tools. It’s doubtful whether we’ll end up using these same tools in our actual workflows. The main reason for this is the sheer size of the collection, which we estimated at some 15,000 physical carriers; possibly even more. At those volumes we will need a solution that involves the use of a disk robot, and these often require dedicated software (we still need to investigate this more in-depth).
Nevertheless, throughout the initial testing phase I was surprised at the number of useful tools that are available in the open source domain. Since this will probably be of interest to others as well, I decided to polish a selection from my rough working notes into a somewhat more digestible form (or so I hope!). I edited my original notes down to the following topics:
- How to figure out the device path of the CD drive
- How to create an ISO image from a CD-ROM or DVD
- How to check the integrity of the created ISO image
- How to extract audio from an audio CD
In addition there’s a final section that covers my attempts at imaging a multisession / mixed mode CD. The result of this particular exercise wasn’t all that successful, but I included it anyway, as some may find it useful. All software mentioned here are open-source tools that are available for any modern Linux distribution (I’m using Linux Mint myself). Some can be used under Windows as well using Cygwin.
Find the device path of the CD drive (Linux)
The majority of the tools covered by this blog post need the device path of the CD drive as a command-line argument. Under Linux you can usually find this by inspecting the output of the following command (run this while a CD or DVD is inserted in your drive):
If all goes well, the result will look similar to this:
/dev/sda1 on / type ext4 (rw,errors=remount-ro) /dev/sr0 on /media/johan/REBELS_0 type iso9660 (ro,nosuid,nodev,uid=1000,gid=1000,iocharset=utf8,mode=0400,dmode=0500,uhelper=udisks2)
So, in this case the path to the CD drive is /dev/sr0 (if you have multiple optical drives you may also see/dev/sr1, and so on).
Finding the device path on Windows (Cygwin)
For some reason the mount command doesn’t result in the printing of any device paths in CygWin. Instead, try this:
Which produces a list of all devices:
clipboard dsp mqueue random sda2 sdc1 stdin ttyS2 conin fd null scd0 sdb shm stdout urandom conout full ptmx sda sdb1 sr0 tty windows console kmsg pty0 sda1 sdc stderr ttyS0 zero
In the above output both sr0 and scd0 point to the CD drive, and either the full paths /dev/sr0 or /dev/scd0 will work (again in case of multiple drives you may be looking for /dev/sr1 or /dev/scd1).
In all examples below I assumed that the device path is /dev/sr0; substitute your own path if necessary.
Create ISO image of a CD-ROM or DVD
A number of tools allow you to create an (ISO1) image from a CD-ROM or DVD. Although generic Unix data copying and recovery tools like dd and ddrescue are often used for this, various people have pointed out that the result may be unreliable because they only perform limited error checking. See for example the comments here and here; both recommend to use the readom tool, which is part of the cdrkit library.
Then run readom as root:
sudo readom retries=4 dev=/dev/sr0 f=mydisk.iso
Here the value of the retries parameter defines the number of attempts that readom will make at trying to recover unreadable sectors. The default value is 128, which can result in huge processing times for CDs that are seriously damaged. The f parameter sets the name of the image file that is created. If all goes well the following output is printed to the screen at the end of the imaging process:
Read speed: 4234 kB/s (CD 24x, DVD 3x). Write speed: 0 kB/s (CD 0x, DVD 0x). Capacity: 309104 Blocks = 618208 kBytes = 603 MBytes = 633 prMB Sectorsize: 2048 Bytes Copy from SCSI (10,0,0) disk to file 'mydisk.iso' end: 309104 addr: 309104 cnt: 44 Time total: 259.287sec Read 618208.00 kB at 2384.3 kB/sec.
Check integrity of ISO image against physical CD-ROM or DVD
You can use check the integrity of the created ISO image by computing a checksum on both the ISO file and the physical disk, and then comparing both:
md5sum mydisk.iso md5sum /dev/sr0
Note that the aforementioned Aaron Toponce article claims that readom already does a checksum check. If true, the additional check would be overkill (especially given that computing a checksum on a physical CD or DVD is time consuming). However, I couldn’t find any confirmation of this in either readom‘s documentation nor its source code (although I found the source hard to read, so I may have simply overlooked it).
Verify ISO image
In theory, there shouldn’t be any need for additional quality checks on an ISO image once its integrity against the physical carrier is confirmed by the checksum. However, since cdrkit includes an isovfy tool that claims to " verify the integrity of an iso9660 image", I decided I might as well give it a try. It works by entering:
Here’s some example output:
Root at extent 13, 2048 bytes [0 0] No errors found
The documentation of the tool isn’t very clear about what specific checks it performs. In one of my tests I fed it an ISO image that had its last 50 MB missing (truncated). This did not result in any error or warning message! Most of the reported isovfy errors that I came across in my tests simply reflected the file system on the physical CD not conforming to ISO 9660 (this seems to be pretty common). Based on this it looks like isovfy isn’t very useful after all.
Get information about an ISO image
The Primary Volume Descriptor (PVD) of an ISO 9660 file system contains general information about the CD or DVD. The isoinfo tool (which is also part of cdrkit) is able to print the most important PVD fields to the screen:
isoinfo -d -i mydisk.iso
CD-ROM is in ISO 9660 format System id: Volume id: REBELS_0 Volume set id: Publisher id: Data preparer id: Application id: NERO - BURNING ROM Copyright File id: Abstract File id: Bibliographic File id: Volume set size is: 1 Volume set sequence number is: 1 Logical block size is: 2048 Volume size is: 333151 Joliet with UCS level 3 found NO Rock Ridge present
You can also run isoinfo directly on the physical carrier:
isoinfo -d -i /dev/sr0
To get a listing of all files and directories that are part of the filesystem, use this:
isoinfo -f -i mydisk.iso
/AUTORUN.EXE;1 /AUTORUN.INF;1 /DISK0 /LICENSE2.TXT;1 /LICENSEF.TXT;1 /LICENSEU.TXT;1 /SETUP.EXE;1 /DISK0/CONTROLS.CFG;1 /DISK0/DISK0;1 :: :: etc
It looks like all items that are followed by ;1 are files, and those that aren’t are directories. Also, the -l option can be used for a detailed list that includes additional file attributes (size, date, etc.).
Rip audio CD with cdparanoia
The data structure of an audio CD is fundamentally different from a CD-ROM or DVD, and because of this its content cannot be stored as an ISO image. The most widely-used approach is to extract (or "rip") the audio tracks on a CD to separate WAVE files. A complicating factor here is that the way audio is encoded on a CD tends to obscure (small) read errors during playback. As a result, a single linear read will not result in a reliable transfer of the audio data. More details can be found in this excellent article by Alexander Duryee. Duryee recommends a number of extraction tools that overcome this problem using sophisticated verification and correction functionality. One of these tools is the cdparanoia ripper. As an example, the following command can be used to rip a CD in batch mode, where each track is stored as a separate WAVE file:
cdparanoia -B -L
cdparanoia -B -l
The -L switch results in the generation of a detailed log file; -l produces a summary log (name: cdparanoia.log)3. File names are generated automatically like this:
track01.cdda.wav track02.cdda.wav track03.cdda.wav
Here is a link to an example log file. The output may look a little weird at first sight, which is because cdparanoia reports all status and progress information as symbols and smilies, respectively. Their meaning is explained in the documentation.
Extract data from multi-session / mixed mode CDs
Some CDs combine data and audio tracks. Examples are "enhanced" audio CDs that include software or movies as bonus material, as well as many ’90s video games. Even though the data part of such CDs is typically compatible with an ISO 9660 file system, the audio tracks are not. Since there is no good, open and mature file format to describe the contents of a CD precisely, such CDs pose a particular challenge. I’ve only done limited testing on mixed-mode CDs, and I haven’t figured out a satisfactory way to process them. However, since many people appear to be struggling with this, I’ll briefly report my results so far.
This article on the Linux Reviews site contains instructions on how to rip a mixed-mode CD using cdrdao. I followed these instructions in an attempt to make a copy of They Might Be Giants’ "No" album (which contains some video content). First I unmounted the disk:
Then I ran ran cdrdao with the following arguments:
cdrdao read-cd --read-raw --datafile no.bin --device /dev/sr0 --driver generic-mmc-raw no.toc
The result of this is a disk image in BIN/TOC format. The .toc file looks like this:
CD_DA // Track 1 TRACK AUDIO NO COPY NO PRE_EMPHASIS TWO_CHANNEL_AUDIO ISRC "USIR70200001" FILE "no.bin" 0 02:10:53 // Track 2 TRACK AUDIO NO COPY NO PRE_EMPHASIS TWO_CHANNEL_AUDIO ISRC "USIR70200002" FILE "no.bin" 02:10:53 02:17:34 :: etc
Closer inspection showed that only the audio tracks were copied, not the data track! As a comparison, below example from thecdrdao documentation shows the expected output:
CD_ROM TRACK MODE1 DATAFILE "data_1" ZERO 00:02:00 // post-gap TRACK AUDIO SILENCE 00:02:00 // pre-gap START FILE "data_2.wav" 0 TRACK AUDIO FILE "data_3.wav" 0
In particular I would expect the .toc file to start with CD_ROM, and I would also expect one TRACK MODE1 item for the data part of the disk. It’s not clear to me why my test produced a different result. Interestingly, I was able to make separate images of the data and the audio components of the disk by adding the –session option:
Used for read-toc and read-cd to specify the session which should be processed on multi session CDs.
Running cdrdao twice, setting –session to 1 and then 2 resulted in two separate images that contain the audio and file system data, respectively. This is probably not all that useful; if the data and audio content end up being separated anyway, I would much rather use readom to make an ISO of the data, and then run cdparanoia to rip the audio tracks to WAVE. In any case, this needs further work.
The rough, unedited notes on which this blog post is based can be found here (they contain some additional material that I left out here for readability).
Here’s an experimental Python script that verifies if the file size of a CD / DVD ISO 9660 image is consistent with the information in its Primary Volume Descriptor. This can be useful for detecting incomplete (e.g. truncated) ISO images.
Whether the resulting image will conform to ISO 9660 depends on the source medium, as the image is simply a byte-exact copy of the data on the physical carrier’s file system. So for a DVD that uses the UDF format, the ISO image will be UDF as well.↩
If you don’t do this you will end up with this error: Error trying to open /dev/sr0 exclusively (Device or resource busy)… retrying in 1 second.↩
Strangely, in my tests a parse error occurred when I specified user-defined file names here. Also, it appeared that the summary log file resulted in more detailed output than the detailed one. This needs a more in-depth look!↩