Histogram Equalization Using Video Frame Buffer

Video processing applications often store a full frame of video data to process the frame and modify the next frame. In such designs video frames are stored in external memory while FPGA resources are used to process same data. This example shows how to design a video application with HDMI input and output performing histogram equalization using external memory for video frame buffering.

Design Task and System Requirements
Design Using SoC Blockset
Modeling Additional Memory Consumers
Implement and Run on Hardware
Summary

Supported hardware platform

Xilinx® Zynq® ZC706 evaluation kit + FMC-HDMI-CAM mezzanine card

Design Task and System Requirements

Consider an application involving continuous streaming of video data through the FPGA. The FPGA calculates the histogram of the incoming video stream, in the 'FPGA' subsystem, while streaming the same video stream to external memory for storage. Once the histogram has been calculated and accumulated across the entire video frame, a synchronization signal is toggled to trigger the read back of the stored frame from external memory. The accumulated histogram vector is then applied to the video stream read back from external memory to perform the equalization algorithm. The external memory frame buffer is modeled using the 'Memory Channel' block in AXI4-Stream Video Frame Buffer mode.

The 'HDMI Input' block reads a video file and provides video data and control signals to downstream FPGA processing blocks. Video data is in YCbCr 4:2:2 format, and the control signals are in the pixel control bus format. The 'HDMI Output' block reads video data and control signals, in the same format as output by the 'HDMI Input' block, and provides a visual output using the Video Display block.

The Push Button block enables bypassing of the histogram equalization algorithm, routing the unprocessed output from the external memory frame buffer to the output.

There are a number of requirements to consider when designing an application that interfaces with external memory:

Throughput: What is the rate that you need to transfer data to/from memory to satisfy the requirements of your algorithm? Specifically for vision applications, what is the frame-size and frame-rate that you must be able to maintain?

Latency: What is the maximum amount of time that your algorithm can tolerate between requesting and receiving data? For vision applications, do you need a continuous stream of data, without gaps? Are you able to buffer samples internal to your algorithm in order to prevent data loss when access to the memory is blocked?

For this histogram equalization example, we have defined the following requirements:

Throughput must be sufficient to maintain a 1920x1080p video stream at 60 frames-per-second.
Latency must be sufficiently low so as not to drop frames.

With the above throughput requirement, we can calculate the value that is required for the frame buffer:

$1920\times1080\times60 = 124.416\,\mathrm{Msps}$

As the video format is YCbCr 4:2:2, we require 2 bytes-per-pixel (BPP), this equates to a throughput requirement of

$2\times124.416 = 248.832\,\mathrm{MB/s}$

Because the algorithm must both write and read the video data to/from the external memory, this throughput requirement must be doubled, for a total throughput requirement of

$2\times248.832 = 497.664\,\mathrm{MB/s}$

Design Using SoC Blockset

In general, your algorithm will be a part of a larger SoC application. In such applications, it is likely that there will be other algorithms also requiring access to external memory. In this scenario, you must consider the impact of other algorithm's memory accesses on the performance and requirements of your algorithm. Assuming that your algorithm shares the memory channel with other components, you should consider the following:

What is the total available memory bandwidth in the SoC system?
How will your algorithm adapt to shared memory bandwidth?
Can your algorithm tolerate an increased read/write latency?

By appropriate modeling of additional memory consumers in the overall application, you can systematically design your algorithm to meet your requirements in situations where access to the memory is not exclusive to your algorithm.

To avoid modeling of all memory readers and writers in the overall system, you can use 'Memory Traffic Generator' blocks to consume read/write bandwidth in your system by creating access requests. In this way, you can simulate additional memory accesses within your system without explicit modeling.

Modeling Additional Memory Consumers

When implemented on hardware, the HDMI output requires an additional frame buffer for synchronization of the video stream data between clock-domains, and introduces an additional memory consumer in the overall system. You can model this using Memory Traffic Generator blocks to simulate the additional memory consumption. As we are modeling both read and write transactions, we will use two Memory Traffic Generator blocks - one each for read and write.

Based on the throughput calculation for our 1080p video stream, we know that the additional frame buffer will require $497.664\,\mathrm{MB/s}$ of bandwidth for simultaneous read and write access.

The write transactions are modeled by HDMI Buffer Write and the read transactions are modeled by HDMI Buffer Read. The block mask for both are shown below.

The total burst requests are configured as inf, as we want to simulate a continuous stream of data to/from the memory. This will ensure that the traffic generator block will continue to issue transaction requests for the entirety of the simulation.

The burst size is specified as 192, which is the 1/10th of pixels per line. As the burst size is specified in bytes, this is equivalent to one tenth of a single line of a single component of the output video stream, i.e. a single line of the Y-component of the YCbCr 4:2:2 video stream.

The time between burst is specified as 1/1296000. This can be expanded as

$\frac{192}{1080\times1920\times60\times2}$

where,

192 is the number of bytes per burst,

1080 is the number of lines in the video stream,

1920 is the number of pixels per line in the video stream,

60 is the number of frames-per-second and,

2 is the number of components in our video stream.

Putting the above parameters together, we can calculate our requested throughput as follows:

$192\times1296000 = 248.832 \, \mathrm{MB/s}$

And, as we have two traffic generators to simulate both read and write transactions, the total bandwidth consumption will be $497.664\, \mathrm{MB/s}$

Simulating the system with the above settings results in the following Memory Bandwidth Usage plot.

Here, the memory masters are as follows:

Master 1: Frame Buffer write
Master 2: Frame Buffer read
Master 3: HDMI Buffer Write (Memory Traffic Generator)
Master 4: HDMI Buffer Read (Memory Traffic Generator
Master 5: Contention (Memory Traffic Generator) (commented out)

You can see that all 4 active masters are consuming 248.8 MB/s of memory bandwidth.

More Memory Consumers: Consider that your algorithm is part of a larger system, and a secondary algorithm is being developed by a colleague or third-party. In this scenario, the secondary algorithm will be developed separately for the interest of time and division of work. Rather than combine the two algorithms into a single simulation, you can model the memory access of the secondary algorithm using a Memory Traffic Generator, and simulate the impact, if any, that it will have on your algorithm.

For example, assume that you are provided with the following memory requirements for the secondary algorithm:

Throughput: 650 MB/s

Given that we know that at any one time the primary algorithm, plus the HDMI output frame buffer, is consuming ~995 MB/s of the memory bandwidth, and our total available memory bandwidth is 1600 MB/s, we know that with the total bandwidth requirement for our system exceeds the total available bandwidth by ~50 MB/s.

To enable the modeling of the secondary algorithm memory access, uncomment the Contention Memory Traffic Generator block. The block mask settings are shown below.

Simulating the system with the secondary algorithm's memory accesses, results in the following Memory Bandwidth Usage plot.

As you can see, the combined required memory bandwidth exceeds the available bandwidth at around 0.03s - when the secondary algorithm begins memory access requests, resulting in the other masters not achieving their required throughput. Looking at the logic analyzer waveform, we can see this manifested as dropped buffers for the Frame Buffer write master, meaning that the input video frame will not be written to memory.

Implement and Run on Hardware

Following products are required for this section:

HDL Coder™
SoC Blockset Support Package for Xilinx Devices

To implement the model on a supported SoC board use the SoC Builder application. Open the mask of 'FPGA' subsystem and set model variant to 'Pixel based processing'.

Comment out 'HDMI Buffer Write', 'HDMI Buffer Read' and 'Contention' blocks.

Open SoC Builder from the Tools menu and follow these steps:

Select 'Build Model' on 'Setup' screen. Click 'Next'.
Click 'View/Edit Memory Map' to view the memory map on 'Review Memory Map' screen. Click 'Next'.
Specify project folder on 'Select Project Folder' screen. Click 'Next'.
Select 'Build, load and run' on 'Select Build Action' screen. Click 'Next'.
Click 'Validate' to check the compatibility of model for implementation on 'Validate Model' screen. Click 'Next'.
Click 'Build' to begin building of the model on 'Build Model' screen. An external shell will open when FPGA synthesis begins. Click 'Next'.
Click 'Next' to 'Load Bitstream' screen.

The FPGA synthesis may take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following these steps:

Close the external shell to terminate synthesis.
Copy pre-generated bitstream to your project folder by running the command below and then,
Click 'Load and Run' button to load pre-generated bitstream and run the model on SoC board

>>copyfile(fullfile(matlabroot,'toolbox','soc','socexamples','bitstreams','soc_histogram_equalization_top-zc706.bit'), './soc_prj');

To run the model, execute the following aximaster test bench for soc_histogram_equalization_top_aximaster.

The following figure shows the Memory Bandwidth usage when the application is deployed on hardware.

Summary