Fast resize of jeeps on the video card
In applications for working with images, the problem of resizing jeeps (JPEG compressed images) is quite common. In this case, you can not immediately make a resize and you must first decode the original data. There is nothing complicated and new in this, but if it needs to be done many millions of times a day, then the optimization of the performance of such a solution, which should be very fast, becomes especially important.
This task is often encountered when organizing remote hosting for image storage, since most cameras and phones are shot in JPEG format. Daily photo archives of leading web services (social networks, forums, photo hosting and many others) are replenished with a significant number of such images, so the question of how to store such pictures is extremely important. To reduce the size of outgoing traffic and to improve the response time to the user's request, many web services store dozens of files for one image in different resolutions. The speed of response is good, but these copies take up a lot of space. This is the main problem, although there are other disadvantages to this approach.
The idea of solving this problem is to not store many versions of the original image on the server in different resolutions, but dynamically create the desired picture with the specified sizes from the previously prepared original, and as quickly as possible. Thus, in real time, you can create an image of the desired resolution and immediately send it to the user. It is very important that the resolution of this image can be made immediately so that the user's device does not make a screen resize, since it simply will not be necessary.
The use of other formats other than JPEG as the basis for organizing such an image storage does not seem justified. Of course, there are standard, widely used formats that give better compression with similar quality (JPEG200? WebP), but the encoding and decoding speed of such images is very small compared to JPEG, so it makes sense to choose JPEG as the base format for storing original photos, which, if necessary, will be scaled in real time after receiving the request from the user.
Of course, in addition to jeeps on each site, there are often images in PNG and GIF formats, but usually their relative number is small, and photos in these formats are kept extremely rare. Therefore, these formats will not make a significant impact on the problem under consideration in most cases.
Description of the resize algorithm on the fly
So, the input data are JPEG files, and in order to achieve fast decoding (this is true for both CPU and GPU), compressed images should have built-in restart markers. These markers are described in the JPEG standard and some codecs can work with them, the rest are able to ignore them. If there are no such markers for jeeps, you can add them in advance using the jpegtran utility. When adding markers, the image does not change, but the file size becomes slightly larger. As a result, we get the following scheme of work:
We get the image data from the CPU
If there is a color profile, we get it from the EXIF section and store
Copy the picture to the video card
We decode JPEG
We make a resize according to the Lanczos algorithm (decrease)
We impose sharpness
We encode the image according to JPEG
Copy the image to host
Add the original color profile
to the resulting file.
You can make a more accurate decision, when before the resize the reverse gamma is applied to each component of the pixel, so that the resize was in linear space, and then the gamma is applied again, but after the sharp. The actual difference for the user is small, but it exists, and the computational cost for such a modification is minimal. It is only necessary to insert an overlay of the reverse and direct gamma in the overall processing scheme.
A solution is also possible when decoding of jeeps is performed on a multi-core CPU using the libjpeg-turbo library. In this case, each picture is decoded in a separate CPU thread, and all other actions are performed on the video card. With a large number of CPU cores, this can happen even faster, but there will be a serious loss in latency. If latency when decoding a jeep on one CPU core is acceptable, then this option can be very fast, especially for the case when the source jeeps have a small resolution. If you increase the resolution of the original image, the decoding time of the jeep in one CPU thread will increase, so this option can only be used for small resolutions.
Basic requirements to the problem of resize for the web
- It is desirable not to store on the server dozens of copies of each image in different resolutions, and quickly create the desired picture with the correct resolution immediately upon receipt of the request. This is important to reduce the size of the storage, because otherwise you have to store many different copies of each image.
- The task should be solved as quickly as possible. This is the question of the quality of the service provided in terms of reducing the response time to the user's request.
- The quality of the posted image must be high.
- The file size for the sent image should be as small as possible, and its resolution should exactly match the size of the window in which it appears. Here the following points are important:
a). If the image size does not match the window size, then the user device (phone, tablet, laptop) before making the picture on the screen will make the hardware resize after decoding. In OpenGL, this hardware resize is done only by a bilinear algorithm, which often causes the appearance of moiré (divorce) and other artifacts on images containing small parts.
b). The screen resize further consumes the energy of the device.
at). If you use a series of pre-scaled images to solve a problem, you will not always be able to accurately hit the correct size, which means you'll have to send a larger resolution picture. The increased size of the image leads to more traffic, which also would be desirable to avoid.
Description of the general scheme of work
1. We receive images from users in any formats and in any resolutions. We store the originals in a separate database (if necessary).
2. Offline using ImageMagick, OpenCV or similar software, save the color profile, convert the original original images to the standard BMP or PPM format, then make the resize before the resolution of 1K or 2K and compress into JPEG, then use the jpegtran utility to add restart markers with fixed interval.
3. We create a database of such 1K or 2K images.
4. When receiving a request from the user, we get information about the picture and the size of the window, where this image should be displayed.
5. We find the image in the database and send it to the reseller.
6. The Resizer receives an image file, decodes, makes a resize, sharps, encodes and inserts the original color profile into the resulting jeep. After that, he gives the picture to the external program.
7. On each video card, you can run multiple threads, and you can install several video cards on the computer, thereby achieving performance scaling.
8. All this can be done on the basis of NVIDIA Tesla graphics cards (for example, P40 or V100), since NVIDIA GeForce graphics cards are not designed for continuous long-term operation, and NVIDIA Quadro has many video outputs, which in this case are not needed. To solve this problem, the requirements for the size of the GPU memory are minimal.
9. Also from the base with the prepared images, you can dynamically allocate a cache for frequently used files. There it makes sense to store frequently used images from the statistics of the previous period.
Parameters of the program
1. The width and height of the new image. They can be any and they are best set explicitly.
2. Decimation mode JPEG (subsampling). There are three options: 4: 2: ? 4: 2: 2 and 4: 4: ? but usually use 4: 4: 4 or 4: 2: 0. The maximum quality is 4: 4: ? the minimum frame size is 4: 2: 0. Decimation is done for color-difference components, which the person's vision does not perceive as well as the brightness. For each decimation mode, there is an optimal interval for restart markers to achieve the maximum encoding or decoding rate.
3. JPEG compression quality and decimation mode when creating an image database.
4. Sharp is done in the window 3x? sigma (radius) can be controlled.
5. JPEG compression quality and decimation mode when encoding the final picture. Usually a quality of at least 90% means that this compression is "visually lossless", i.e. An unprepared user should not see the artifacts of the JPEG algorithm under standard viewing conditions. It is believed that for a trained user, 93-95% is needed. The larger this value, the larger the frame size sent to the user, and the longer the decoding and encoding time.
Restart markers. We can quickly decode the video card in JPEG format only if there are restart markers inside it. In the official JPEG standard, these markers are described, this is a standard parameter. If restart-markers are not present, then the decoding of the picture on the video card can not be parallelized, which will lead to a very low decoder speed. Therefore, the base of the prepared images, in which there are these markers, is needed.
Fixed algorithm for image codec. Decoding and encoding of images using the JPEG algorithm is by far the fastest option.
Resolution of images in the prepared database can be any, and as options we consider 1K and 2K (you can take 4K). You can also make not only a reduction, but also an increase in the images when resizing.
Productivity of the fast resize
We tested the application for fast resize from the Fastvideo SDK on the NVIDIA Tesla V100 graphics card (OS Windows Server 201? 64-bit, driver ???.9826) on 24-bit images 1k_wild.ppm and 2k_wild.ppm with a resolution of 1K and 2K (1280x720 and 1920x1080). Tests are performed for a different number of threads running on the same graphics card. This requires no more than 110 MB of memory on the video card per stream. 4 threads need no more than 440 MB.
First we compress the original images in JPEG with a quality of 90%, with a decimation of 4: 2: 0 or 4: 4: 4. Then decode and make the resize 2 times in width and height, make a sharp, then again encode c quality of 90% in 4: 2: 0 or 4: 4: 4. The initial data is in the RAM, the final image is placed there.
The operating time is considered from the beginning of loading the original picture from the RAM to storing the processed image in the RAM. The initialization time of the program and allocation of memory on the video card are not included in the measurements.
Example command line for a 24-bit image 1K
PhotoHostingSample.exe -i 1k_wild.???.jpg -o 1k_wild.640.jpg -outputWidth 640 -q 90 -s 444 -sharp_after ??? -repeat 200
Benchmark for processing one image 1K in one stream
Decoding (including sending data to the video card): ??? msec
Resize twice (in width and in height): ??? msec
Sharp: ??? msec
Encoding of JPEG (including data transfer from the video card): ??? msec
Total time per frame: 1.2 msec
Productivity for 1К
Frame rate (Hz)
4: 4: 4/4: 2: 0
4: 4: 4/4: 2: 0
4: 4: 4/4: 2: 0
993 / 831
4: 4: 4/4: 2: 0
Productivity for 2K
Frame rate (Hz)
4: 4: 4/4: 2: 0
4: 4: 4/4: 2: 0
4: 4: 4/4: 2: 0
4: 4: 4/4: 2: 0
Decimation of 4: 2: 0 for the original image reduces the speed of work,The sizes of the source and destination files become smaller. When switching to 4: 2: ? the degree of parallelism drops by 4 times, since now the 16x16 block is considered as one whole, therefore in this mode the speed of operation is lower than for 4: 4: 4.
The performance is mainly determined by the decoding stage of JPEG, because at this stage the picture has the maximum resolution, and the computational complexity of this processing stage is higher than all the others.
The test results showed that for the NVIDIA Tesla V100 graphics card, the processing speed of 1K and 2K images is maximum at the start of 2-4 streams at the same time, and is from 800 to 1000 frames per second on one video card. Processing 1K pictures is faster than 2K, and working with 4: 2: 0 images is always slower than with 4: 4: 4. To get the final result on performance, you need to accurately determine all the parameters of the program and optimize it for a specific model of the video card.
Latency of the order of one millisecond is a good result. As far as we know, this latency can not be obtained for a similar task of resizing on the CPU (even if there is no need to encode and decode jeeps), so this is another important argument in favor of using video cards in high-performance image processing solutions.
To process one billion jeeps per day with 1K or 2K resolutions, it may take up to 16 NVIDIA Tesla V100 graphics cards. Some of our customers are already using this solution, others are testing it in their tasks.
Resizing jeeps on a video card can be very useful not only for web services. There are a huge number of high-performance imaging applications where such functionality can be claimed. For example, a fast resize is very often needed for virtually any image processing scheme received from cameras before displaying a picture on a monitor. This solution can work for Windows /Linux on any NVIDIA graphics cards: Tegra K1 /X1 /X2 /Xavier, GeForce GT /GTX /RTX, Quadro, Tesla.
Advantages of the solution with fast resize on the video card
- Significant reduction of the storage size for source images
- Reduction of primary costs for the cost of infrastructure (hardware and software)
- Improvement of service quality due to a short response time
- Reducing the outgoing traffic
- Less energy consumption on users' devices
- Reliability and speed of the presented solution, which has already been tested on huge data sets
- Reduced development time for the market launch of such applications for Linux and Windows
- Scalability of the solution, which can work both on a single video card and as part of a
- Fast return on investment for such projects
To whom it may be interesting
A library for fast jeep resizing can be used in high-loaded web services, large online stores, social networks, online photo management systems, e-commerce, virtually any large enterprise management software.
Software developers can use this library, which provides latency of the order of several milliseconds for the resolution of jeeps with 1K, 2K and 4K resolution on the video card.
Apparently, this approach may be faster than NVIDIA DALI solution for fast decoding of jeeps, resize and image preparation during the neural network training for Deep Learning.
What else can you do
In addition to resize and sharp, it is possible to add croep, turns to 90/180/27? watermark imposition, brightness and contrast control in the existing algorithm.
Optimization of the solution for NVIDIA Tesla P40 and V100 graphics cards.
Additional optimization of the performance of the JPEG decoder.
A mode of a pack for decoding of jeeps on a videocard.
It may be interesting