Lossy Compressed Image Formats Study

Mozilla Corporation, July 2014

Introduction
Quality Metrics
Image Sets
Methodology
Results (Raw Data)
Bits per Pixel at Equivalent Quality According to PSNR-HVS-M
Bits per Pixel at Equivalent Quality According to Y-SSIM
Bits per Pixel at Equivalent Quality According to MS-SSIM
Bits per Pixel at Equivalent Quality According to RGB-SSIM
Note on Tuning for Metrics
Note on Luma-only Metrics
Bibliography and Relevant Reading
Contributors

Introduction

This study compares the compression performance of four different image formats: JPEG, WebP, JPEG XR, and HEVC-MSP. The latter two formats were chosen because they are frequently discussed as possible JPEG successors. Two different JPEG encoders are tested, libjpeg-turbo and mozjpeg.

It is our intent to only address compression performance in this study. Other technical, legal, and market factors that might be considered when evaluating codecs are outside the scope of this study.

Quality Metrics

We chose to test with four algorithms:

PSNR-HVS-M
- Peak Signal to Noise Ratio taking into account Contrast Sensitivity Function (CSF) and between-coefficient contrast masking of DCT basis functions [5].
Y-SSIM
- Structural Similarity algorithm [3] applied to luma channel only.
MS-SSIM
- Multi-scale Structural Similarity [4] applied to luma channel only.
RGB-SSIM
- Average of Structural Similarity algorithm [3] applied to R, G, and B channels.

All of these algorithms compare two images and return a number indicating the degree to which the second image is similar to the first. In all cases, no matter what the scale, higher numbers indicate a higher degree of similarity.

It's unclear which algorithm is best in terms of human visual perception, so we tested with four of the most respected algorithms.

Image Sets

Wikipedia: 49 images at varying sizes downsampled from high resolution photos on Wikipedia, download here.
Kodak: 24 PNG images from the Kodak Lossless True Color Image Suite.
Tecnick: 100 PNG images from Tecnick's public test images. Images used are the original size RGB color images.

Methodology

All results should be easily reproducible using publicly available tools. The following software is used to generate results for this study:

Test Automation
- Scripts can be found in our test suite github repository.
Encoders and Decoders
- Encoding and decoding for JPEG, WebP and JPEG XR is done via custom encoder and decoder wrappers, largely in order to input and output CCIR 601 full-range Y'CbCr 4:2:0 consistently. C source code for these wrappers is available in our test suite github repository.
  - We wrap libjpeg-turbo to encode and decode JPEG images. This study uses version 1.3.1.
  - We wrap mozjpeg to encode JPEG images for mozjpeg results. This study uses version 2.0.
  - We wrap libwebp to encode and decode WebP images. This study uses version 0.4.0.
  - We wrap jxrlib to encode and decode JPEG XR images. This study uses git revision ccf11047dbec.
- Encoding and decoding for HEVC-MSP is done via the TAppEncoderStatic and TAppDecoderStatic programs. Both programs are part of the jctvc-hm software package. No wrapper is needed as these programs accept and output CCIR 601 full-range Y'CbCr 4:2:0. This study uses r4029 of the SVN-based source code.
Metrics
- Quality metrics are produced by programs which are available in our test suite github repository. These programs are based on the metrics programs in daala's tools.
- We wrap the IQA library to calculate MS-SSIM scores.
Other
- identify is used to extract width and height information from images, and convert is used to convert between PNG and PPM formats. Both tools are part of the ImageMagick tools. This study uses version 6.8.9-5.
- We don't show data for the bottom ~25% of quality level because these are rarely used. We also clip some data from the top of the quality spectrum for HEVC and JPEG XR because other encoders don't create images at equivalent levels, thus we can't show comparable data.

PNG test images are converted to CCIR 601 full-range Y'CbCr 4:2:0, which is then fed directly into the encoders. The encoders produce an image in their respective formats. We record the size of the resulting encoded image file. HEVC-MSP files are penalized 80 bytes per image file because HEVC-MSP is just a bitstream with no container. This penalty approximates the size of container data. This encoded image is then decoded back to CCIR 601 full-range Y'CbCr 4:2:0, the same format the encoder was given to encode. Quality scores are calculated on the basis of the Y'CbCr image that was fed to the encoder and the decoded Y'CbCr image.

This is done for the top 75% of quality levels available (encoders typically have about 100 possible quality levels). We clip the bottom 25% of quality levels because these are rarely used and results can be erratic due to heavily distorted images. We also clip a few quality levels from the top of the HEVC and JPEG XR quality spectrum because they exceed the highest quality levels that the other encoders are capable of achieving, thus making comparison impossible. People are not likely to use these quality setting clipped from the upper end of the quality spectrum either, since selecting a lower setting will result in a visually indistinguishable image with a much smaller file size. See the rd_collect.py script for more information on exactly what is clipped.

After collecting encoded image size and quality metrics for each image at each quality level, we average the file sizes and quality scores across all images. Quality scores are weighted by pixel count. See the rd_average.py script for more information.

Results (Raw Data)

The following zip archive contains textual data (.out) files with the full results for this study. These can be opened with any text editor or imported into most spreadsheet programs.

Data Archive

Bits per Pixel at Equivalent Quality According to PSNR-HVS-M

The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.

Graph 1: Wikipedia image set, PSNR-HVS-M quality metric, left is better

Graph 2: Average for Kodak image set, PSNR-HVS-M quality metric, left is better

Graph 3: Average for Tecnick image set, PSNR-HVS-M quality metric, left is better

Bits per Pixel at Equivalent Quality According to Y-SSIM

Graph 1: Wikipedia image set, Y-SSIM quality metric, left is better

Graph 2: Average for Kodak image set, Y-SSIM quality metric, left is better

Graph 3: Average for Tecnick image set, Y-SSIM quality metric, left is better

Bits per Pixel at Equivalent Quality According to MS-SSIM

Graph 1: Wikipedia image set, MS-SSIM quality metric, left is better

Graph 2: Average for Kodak image set, MS-SSIM quality metric, left is better

Graph 3: Average for Tecnick image set, MS-SSIM quality metric, left is better

Bits per Pixel at Equivalent Quality According to RGB-SSIM

Graph 1: Wikipedia image set, RGB-SSIM quality metric, left is better

Graph 2: Average for Kodak image set, RGB-SSIM quality metric, left is better

Graph 3: Average for Tecnick image set, RGB-SSIM quality metric, left is better

Note on Tuning for Metrics

Different people and organizations trust different metrics for measuring image quality, and thus compression performance. We included four metrics in this study because existing research has not established which metric best matches human visual perception.

Encoder developers typically use a particular metric, or set of metrics, to guide development. Encoders will typically perform best according to the metric(s) targeted by its developers. Theoretically, most encoders could be tuned to perform better on any particular metric while producing valid (compatible) image files. The mozjpeg encoder actually includes the option to target different metrics, so we can use it to illustrate this.

The default tuning configuration for mozjpeg targets PSNR-HVS-M. It tunes for PSNR-HVS-M by default because it does pretty well on most metrics with this setting. This setting is used in all of the graphs shown earlier in this study.

In the following four graphs, mozjpeg is configured to tune for SSIM instead of PSNR-HVS-M. Notice how mozjpeg performs much better in the following graphs according to Y-SSIM and RGB-SSIM, though at the expense of PSNR-HVS-M and MS-SSIM.

Graph 1: Average for Kodak image set, PSNR-HVS-M quality metric, mozjpeg tuned for SSIM metric, left is better

Graph 2: Average for Kodak image set, Y-SSIM quality metric, mozjpeg tuned for SSIM metric, left is better

Graph 3: Average for Kodak image set, MS-SSIM quality metric, mozjpeg tuned for SSIM metric, left is better

Graph 4: Average for Kodak image set, RGB-SSIM quality metric, mozjpeg tuned for SSIM metric, left is better

The raw data for mozjpeg targeting the SSIM metric can be downloaded here.

Note on Luma-only Metrics

Three of the metrics we use (Y-SSIM, MS-SSIM, and PSNR-HVS-M) measure image quality by considering only a single channel, e.g., luma, chroma-blue or chroma-red, disregarding the others. Another metric we use (RGB-SSIM) combines the results for both the luma and chroma planes, but only after color conversion. An issue with codec tuning based on rate-distortion curves for a single-plane metric is that they can produce sub-optimal results. A hypothetical codec that spends bits to improve luma quality and codes no chroma scores well here! This is because rate is typically computed based on the entire coded image size (luma and chroma). Metrics that take both luma and chroma into account do so averaging the results for each, which is problematic because we are not aware of any evidence that this corresponds well with human perception. How to weight the different channels when averaging is something of a guess, one which could well exaggerate the overall impact of chroma. This is an area we would like to see researched further.

Despite the above, there are several reasons to continue using luma-only metrics to measure codec performance. First, the human visual system is more sensitive to variations in brightness than color. This is the motivation behind chroma sub-sampling. Second, in a 4:2:0 image the luma plane accounts for 2/3 of all pixel data, but due to coding techniques often accounts for more than 2/3 of all bits per image. A luma plane approach can only be so wrong. Third, despite being globally decorrelated in an image, the chroma channels often correlate well with luma *locally*. This fact is typically exploited by codecs and a side effect is that improvements in luma coding alone can translate to improvements in chroma quality. Finally, it may be tempting to try and partition an image size into bits that code luma and bits that code chroma. However, modern video codecs use multi-symbol probability models and adapt the context everywhere. It is impossible to causally separate a bit that is used just code a luma value, because its value influences the cost to code a chroma value and vice versa. So we are stuck with total bits / pixel as the most accurate measure of overall rate.

Until research shows how to combine quality scores from the three (or more if you include alpha) image channels into a single useful value, we will continue to use luma-only metrics for any non-subjective testing we do.

Bibliography and Relevant Reading

WebP Compression Study. Google.
Structural similarity. Wikipedia.
The SSIM Index for Image Quality Assessment.
- Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004.
Multi-scale Structural Similarity for Image Quality Assessment.
- Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik, Multi-scale Structural Similarity for Image Quality Assessment, 37th IEEE Asilomar Conference on Signals, Systems and Computers, Nov. 2003.
Nikolay Ponomarenko homepage - PSNR-HVS-M download page.
- Nikolay Ponomarenko, Flavia Silvestri, Karen Egiazarian, Marco Carli, Jaakko Astola, Vladimir Lukin, On between-coefficient contrast masking of DCT basis functions, CD-ROM Proceedings of the Third International Workshop on Video Processing and Quality Metrics for Consumer Electronics VPQM-07, Scottsdale, Arizona, USA, 25-26 January, 2007, 4 p.
JPEGXR updates. Matt Uyttendaele (Microsoft).

Contributors

Mozilla Corporation
- Josh Aas (Primary)
- Nathan Egge
- Frank Bossen