General overview on image and mobile video processing

advices
concepts
General overview on image and mobile video processing

The explosion of media social networking is obvious in today’s society, from entertainment to politics, from private to public sectors, images and videos can be found everywhere. There is a rise in media oriented social networks like Snapchat and Instagram and in users’ preference to enhance their images and videos with various visual effects. Because of this mobile video processing becomes a required feature for almost any social media app. An image is worth 1000 words and a video is worth, on average, 30 images per second.

HyperSense has been developing mobile video processing apps for iOS and Android since 2012. Our first application of this type was iFlipBook. Since, we have expanded our area of expertise, and in 2015 we launched MyBlender, a mobile app for both iOS and Android with the processing power of a professional tool (MyBlender started out with the purpose of developing a mobile video processing app capable of applying the same effects as Adobe Premiere). If you are curious to see MyBlender’s capabilities please check-out the Youtube playlist below or download the iOS app from the App Store:

MyBlender – Mobile video processing app – Video demos 

In this article we present a general overview on developing a mobile video processing application. If you have any questions, please contact us and we will be happy to respond.

Image processing

Digital image processing involves applying sets of algorithms that are changing the pixels of the image. Video processing is very similar as it involves applying the same algorithms on each frame of the video. There are several processing phases that are applied to images or videos. The purpose of some is optimisation, while others add the visual effects. Then media resources can be encoded and compressed to become accessible on mobile devices, desktops, wearables and so on.

The general term used for the sets of algorithms used to modify images is Filter. Filters modify the information in images based on a given algorithm. So, basically, you have an image and a modification rule. But how does this work?

To simplify, an image is a matrix of pixels, each pixel having a specific colour. Due to the limitations of human sight, the image is visible if the pixel density is high. If our eyes could zoom in, then the pixels would become evident.

To better understand filters, imagine that they receive a set of pixels and after processing a new pixel is exported.

Filters can be applied at GPU and CPU levels, but the most effective in terms of speed are applied by the GPU. These GPU filters use a specific language and are called shaders. There are some limits to the number of operations permitted, and most shaders are intended to be fast. If a shader can’t be used in real-time then it’s value drops.

There are 2 steps involved when processing an image:

  • pixel’s colour
  • pixel’s position

Modifying the pixel position often results in rotations (2D or 3D), scale changes and translations. You can also create image puzzles. This type of change will be referred to as Vertex. Modifying the pixel can also result in grey images, colour overlays, contrast changes, highlights, etc. These modifications will be referred to as Fragments.

A simple vertex is:

attribute vec4 position;
attribute vec2 inputTextureCoordinate;
varying vec2 textureCoordinate;
void main()
{
    gl_Position = position;
    textureCoordinate = inputTextureCoordinate;
}

This doesn’t change the coordinates of the pixels, leaving each pixel in its original position.

The aforementioned vertex, combined with the following fragment:

precision highp float;
varying lowp vec2 textureCoordinate; // New
uniform sampler2D srcTexture1; // New
const vec3 W = vec3(0.2125, 0.7154, 0.0721);
void main()
{
    vec4 textureColor = texture2D(srcTexture1, textureCoordinate);
    float luminance = dot(textureColor.rgb, W);

    gl_FragColor = vec4(vec3(luminance), textureColor.a);
}

Will result in a black and white version of the original image.

More complex shaders can use several image inputs. For example, you can overlay 2 images or stitch images one next to the other. The principle remains the same, the vertex will dictate the position for each pixel, while the fragment will specify the colour of the pixel.

For example, if we want to add a sticker on top of an image we must:

  • make sure the images have the same size or set the position of the sticker on the larger image;
  • leave each pixel in its original position using a variation of the vertex above that has 2 inputs (one for each image);
  • set the colour merge of the pixels using the sticker’s alpha as a priority: original_rgb * (1-sticker_alpha) + sticker_rgb * sticker_alpha

Once you understand image processing, you can move on to processing videos. In the end video processing is just processing more images (if you ignore video compression, memory management, multithreading, working with camera framebuffers, etc.).

Video processing

A video is a collection of images in which each image persists on the retina for a short while. If the images change at a given speed, the impression of motion is transmitted. On average, if a human is shown 30 images per second, the brain will interpret them as motion.

Videos can be composed of individual images, shown in order. But this approach would occupy a lot of space. Video encoding reduces the space required for storage, but this implies that a video needs to be decoded before it can be processed.

Similar to image processing, video processing can have one or more inputs. The diagram above doesn’t cover all possibilities of video processing. While, for example, it can create a black and white video, it would be mute.

Adding sound will change the diagram to:

The most common sound filters are volume adjustments.

Several additional sound sources can be added. For example, you can add background music.

There is one other important aspect when processing video: timing. For example, 2 videos can be processed in several different ways:

  • video1-video2
  • video1-overlap-video2
  • video1+video2

Timing also applies to the video’s sound. For example, during the overlap, both video sounds can be played at 0.5 volume, so they both can be distinguished.

Videos, in most encodings, have no alpha channel, all pixels being opaque. This is done in order to optimise the size of the video file.

For some videos effects the missing of the alpha channel is not an issue:

  • displaying a video on top of another;
  • displaying a video next to another;
  • playing videos one after the other, with various transitions between them;
  • use of fade effects, having each image alpha set to 0.5 when merging.

In order to cover some of the effects that require alpha, there are various workarounds, most commonly used are: chroma key, trackmatte. Chroma key uses a specific colour to dictate the alpha value (usually green), if closer a pixel is to this colour the higher the alpha will be at the processing phase. Trackmate uses a black and white matrix, alongside the 2 videos. For example, if the matrix pixel is white, then the pixel from video 1 will be returned, if it’s black, the pixel from video 2 will be returned, and any other variation will result in a merge.

If you would like to read more about our most advanced mobile video processing application please follow this article.