Lossy Compressed Image Formats Study

Mozilla Corporation, July 2014

Introduction

This study compares the compression performance of four different image formats: JPEG, WebP, JPEG XR, and HEVC-MSP. The latter two formats were chosen because they are frequently discussed as possible JPEG successors. Two different JPEG encoders are tested, libjpeg-turbo and mozjpeg.

It is our intent to only address compression performance in this study. Other technical, legal, and market factors that might be considered when evaluating codecs are outside the scope of this study.

Quality Metrics

We chose to test with four algorithms:

All of these algorithms compare two images and return a number indicating the degree to which the second image is similar to the first. In all cases, no matter what the scale, higher numbers indicate a higher degree of similarity.

It's unclear which algorithm is best in terms of human visual perception, so we tested with four of the most respected algorithms.

Image Sets

  1. Wikipedia: 49 images at varying sizes downsampled from high resolution photos on Wikipedia, download here.
  2. Kodak: 24 PNG images from the Kodak Lossless True Color Image Suite.
  3. Tecnick: 100 PNG images from Tecnick's public test images. Images used are the original size RGB color images.

Methodology

All results should be easily reproducible using publicly available tools. The following software is used to generate results for this study:

PNG test images are converted to CCIR 601 full-range Y'CbCr 4:2:0, which is then fed directly into the encoders. The encoders produce an image in their respective formats. We record the size of the resulting encoded image file. HEVC-MSP files are penalized 80 bytes per image file because HEVC-MSP is just a bitstream with no container. This penalty approximates the size of container data. This encoded image is then decoded back to CCIR 601 full-range Y'CbCr 4:2:0, the same format the encoder was given to encode. Quality scores are calculated on the basis of the Y'CbCr image that was fed to the encoder and the decoded Y'CbCr image.

This is done for the top 75% of quality levels available (encoders typically have about 100 possible quality levels). We clip the bottom 25% of quality levels because these are rarely used and results can be erratic due to heavily distorted images. We also clip a few quality levels from the top of the HEVC and JPEG XR quality spectrum because they exceed the highest quality levels that the other encoders are capable of achieving, thus making comparison impossible. People are not likely to use these quality setting clipped from the upper end of the quality spectrum either, since selecting a lower setting will result in a visually indistinguishable image with a much smaller file size. See the rd_collect.py script for more information on exactly what is clipped.

After collecting encoded image size and quality metrics for each image at each quality level, we average the file sizes and quality scores across all images. Quality scores are weighted by pixel count. See the rd_average.py script for more information.

Results (Raw Data)

The following zip archive contains textual data (.out) files with the full results for this study. These can be opened with any text editor or imported into most spreadsheet programs.

Bits per Pixel at Equivalent Quality According to PSNR-HVS-M

The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.

Graph 1: Wikipedia image set, PSNR-HVS-M quality metric, left is better

Graph 2: Average for Kodak image set, PSNR-HVS-M quality metric, left is better

Graph 3: Average for Tecnick image set, PSNR-HVS-M quality metric, left is better

Bits per Pixel at Equivalent Quality According to Y-SSIM

The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.

Graph 1: Wikipedia image set, Y-SSIM quality metric, left is better

Graph 2: Average for Kodak image set, Y-SSIM quality metric, left is better

Graph 3: Average for Tecnick image set, Y-SSIM quality metric, left is better

Bits per Pixel at Equivalent Quality According to MS-SSIM

The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.

Graph 1: Wikipedia image set, MS-SSIM quality metric, left is better

Graph 2: Average for Kodak image set, MS-SSIM quality metric, left is better

Graph 3: Average for Tecnick image set, MS-SSIM quality metric, left is better

Bits per Pixel at Equivalent Quality According to RGB-SSIM

The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.

Graph 1: Wikipedia image set, RGB-SSIM quality metric, left is better

Graph 2: Average for Kodak image set, RGB-SSIM quality metric, left is better

Graph 3: Average for Tecnick image set, RGB-SSIM quality metric, left is better

Note on Tuning for Metrics

Different people and organizations trust different metrics for measuring image quality, and thus compression performance. We included four metrics in this study because existing research has not established which metric best matches human visual perception.

Encoder developers typically use a particular metric, or set of metrics, to guide development. Encoders will typically perform best according to the metric(s) targeted by its developers. Theoretically, most encoders could be tuned to perform better on any particular metric while producing valid (compatible) image files. The mozjpeg encoder actually includes the option to target different metrics, so we can use it to illustrate this.

The default tuning configuration for mozjpeg targets PSNR-HVS-M. It tunes for PSNR-HVS-M by default because it does pretty well on most metrics with this setting. This setting is used in all of the graphs shown earlier in this study.

In the following four graphs, mozjpeg is configured to tune for SSIM instead of PSNR-HVS-M. Notice how mozjpeg performs much better in the following graphs according to Y-SSIM and RGB-SSIM, though at the expense of PSNR-HVS-M and MS-SSIM.

Graph 1: Average for Kodak image set, PSNR-HVS-M quality metric, mozjpeg tuned for SSIM metric, left is better

Graph 2: Average for Kodak image set, Y-SSIM quality metric, mozjpeg tuned for SSIM metric, left is better

Graph 3: Average for Kodak image set, MS-SSIM quality metric, mozjpeg tuned for SSIM metric, left is better

Graph 4: Average for Kodak image set, RGB-SSIM quality metric, mozjpeg tuned for SSIM metric, left is better

The raw data for mozjpeg targeting the SSIM metric can be downloaded here.

Note on Luma-only Metrics

Three of the metrics we use (Y-SSIM, MS-SSIM, and PSNR-HVS-M) measure image quality by considering only a single channel, e.g., luma, chroma-blue or chroma-red, disregarding the others. Another metric we use (RGB-SSIM) combines the results for both the luma and chroma planes, but only after color conversion. An issue with codec tuning based on rate-distortion curves for a single-plane metric is that they can produce sub-optimal results. A hypothetical codec that spends bits to improve luma quality and codes no chroma scores well here! This is because rate is typically computed based on the entire coded image size (luma and chroma). Metrics that take both luma and chroma into account do so averaging the results for each, which is problematic because we are not aware of any evidence that this corresponds well with human perception. How to weight the different channels when averaging is something of a guess, one which could well exaggerate the overall impact of chroma. This is an area we would like to see researched further.

Despite the above, there are several reasons to continue using luma-only metrics to measure codec performance. First, the human visual system is more sensitive to variations in brightness than color. This is the motivation behind chroma sub-sampling. Second, in a 4:2:0 image the luma plane accounts for 2/3 of all pixel data, but due to coding techniques often accounts for more than 2/3 of all bits per image. A luma plane approach can only be so wrong. Third, despite being globally decorrelated in an image, the chroma channels often correlate well with luma *locally*. This fact is typically exploited by codecs and a side effect is that improvements in luma coding alone can translate to improvements in chroma quality. Finally, it may be tempting to try and partition an image size into bits that code luma and bits that code chroma. However, modern video codecs use multi-symbol probability models and adapt the context everywhere. It is impossible to causally separate a bit that is used just code a luma value, because its value influences the cost to code a chroma value and vice versa. So we are stuck with total bits / pixel as the most accurate measure of overall rate.

Until research shows how to combine quality scores from the three (or more if you include alpha) image channels into a single useful value, we will continue to use luma-only metrics for any non-subjective testing we do.

Bibliography and Relevant Reading

  1. WebP Compression Study. Google.
  2. Structural similarity. Wikipedia.
  3. The SSIM Index for Image Quality Assessment.
  4. Multi-scale Structural Similarity for Image Quality Assessment.
  5. Nikolay Ponomarenko homepage - PSNR-HVS-M download page.
  6. JPEGXR updates. Matt Uyttendaele (Microsoft).

Contributors