This study compares the compression performance of four different image formats: JPEG, WebP, JPEG XR, and HEVC-MSP. The latter two formats were chosen because they are frequently discussed as possible JPEG successors. Two different JPEG encoders are tested, libjpeg-turbo and mozjpeg.
It is our intent to only address compression performance in this study. Other technical, legal, and market factors that might be considered when evaluating codecs are outside the scope of this study.
We chose to test with four algorithms:
All of these algorithms compare two images and return a number indicating the degree to which the second image is similar to the first. In all cases, no matter what the scale, higher numbers indicate a higher degree of similarity.
It's unclear which algorithm is best in terms of human visual perception, so we tested with four of the most respected algorithms.
All results should be easily reproducible using publicly available tools. The following software is used to generate results for this study:
libjpeg-turbo
to
encode and decode JPEG images. This study uses version 1.3.1.mozjpeg
to
encode JPEG images for mozjpeg results. This study uses version 2.0.libwebp
to
encode and decode WebP images. This study uses version 0.4.0.jxrlib
to
encode and decode JPEG XR images. This study uses git revision ccf11047dbec.TAppEncoderStatic
and TAppDecoderStatic
programs. Both programs are part of the
jctvc-hm
software package. No wrapper is needed as these programs accept and output CCIR 601
full-range Y'CbCr 4:2:0. This study uses r4029 of the SVN-based source code.identify
is used to extract width and height information from images, and
convert
is used to convert between PNG and PPM formats. Both tools are
part of the ImageMagick
tools.
This study uses version 6.8.9-5.PNG test images are converted to CCIR 601 full-range Y'CbCr 4:2:0, which is then fed directly into the encoders. The encoders produce an image in their respective formats. We record the size of the resulting encoded image file. HEVC-MSP files are penalized 80 bytes per image file because HEVC-MSP is just a bitstream with no container. This penalty approximates the size of container data. This encoded image is then decoded back to CCIR 601 full-range Y'CbCr 4:2:0, the same format the encoder was given to encode. Quality scores are calculated on the basis of the Y'CbCr image that was fed to the encoder and the decoded Y'CbCr image.
This is done for the top 75% of quality levels available (encoders typically have about 100
possible quality levels). We clip the bottom 25% of quality levels because these are rarely
used and results can be erratic due to heavily distorted images. We also clip a few quality
levels from the top of the HEVC and JPEG XR quality spectrum because they exceed the highest
quality levels that the other encoders are capable of achieving, thus making comparison
impossible. People are not likely to use these quality setting clipped from the upper end of
the quality spectrum either, since selecting a lower setting will result in a visually
indistinguishable image with a much smaller file size. See the rd_collect.py
script for more information on exactly what is clipped.
After collecting encoded image size and quality metrics for each image at each quality
level, we average the file sizes and quality scores across all images. Quality scores are
weighted by pixel count. See the rd_average.py
script for more information.
The following zip archive contains textual data (.out) files with the full results for this study. These can be opened with any text editor or imported into most spreadsheet programs.
The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.
The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.
The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.
The goal for this section is to visualize bits per pixel over the full range of quality options. In each graph, the Y axis represents quality and the X axis represents bits per pixel. There is one graph for each image set.
Different people and organizations trust different metrics for measuring image quality, and thus compression performance. We included four metrics in this study because existing research has not established which metric best matches human visual perception.
Encoder developers typically use a particular metric, or set of metrics, to guide development. Encoders will typically perform best according to the metric(s) targeted by its developers. Theoretically, most encoders could be tuned to perform better on any particular metric while producing valid (compatible) image files. The mozjpeg encoder actually includes the option to target different metrics, so we can use it to illustrate this.
The default tuning configuration for mozjpeg targets PSNR-HVS-M. It tunes for PSNR-HVS-M by default because it does pretty well on most metrics with this setting. This setting is used in all of the graphs shown earlier in this study.
In the following four graphs, mozjpeg is configured to tune for SSIM instead of PSNR-HVS-M. Notice how mozjpeg performs much better in the following graphs according to Y-SSIM and RGB-SSIM, though at the expense of PSNR-HVS-M and MS-SSIM.
The raw data for mozjpeg targeting the SSIM metric can be downloaded here.
Three of the metrics we use (Y-SSIM, MS-SSIM, and PSNR-HVS-M) measure image quality by considering only a single channel, e.g., luma, chroma-blue or chroma-red, disregarding the others. Another metric we use (RGB-SSIM) combines the results for both the luma and chroma planes, but only after color conversion. An issue with codec tuning based on rate-distortion curves for a single-plane metric is that they can produce sub-optimal results. A hypothetical codec that spends bits to improve luma quality and codes no chroma scores well here! This is because rate is typically computed based on the entire coded image size (luma and chroma). Metrics that take both luma and chroma into account do so averaging the results for each, which is problematic because we are not aware of any evidence that this corresponds well with human perception. How to weight the different channels when averaging is something of a guess, one which could well exaggerate the overall impact of chroma. This is an area we would like to see researched further.
Despite the above, there are several reasons to continue using luma-only metrics to measure codec performance. First, the human visual system is more sensitive to variations in brightness than color. This is the motivation behind chroma sub-sampling. Second, in a 4:2:0 image the luma plane accounts for 2/3 of all pixel data, but due to coding techniques often accounts for more than 2/3 of all bits per image. A luma plane approach can only be so wrong. Third, despite being globally decorrelated in an image, the chroma channels often correlate well with luma *locally*. This fact is typically exploited by codecs and a side effect is that improvements in luma coding alone can translate to improvements in chroma quality. Finally, it may be tempting to try and partition an image size into bits that code luma and bits that code chroma. However, modern video codecs use multi-symbol probability models and adapt the context everywhere. It is impossible to causally separate a bit that is used just code a luma value, because its value influences the cost to code a chroma value and vice versa. So we are stuck with total bits / pixel as the most accurate measure of overall rate.
Until research shows how to combine quality scores from the three (or more if you include alpha) image channels into a single useful value, we will continue to use luma-only metrics for any non-subjective testing we do.