Clock Rate Pipelining

This example shows how to apply clock rate pipelining to optimize slow paths in your design and thereby reduce latency, increase clock frequency and decrease area usage. For more information on how to use clock-rate pipelining, see Clock-Rate Pipelining.

Introduction

Algorithmic design with Simulink may introduce many slow-rate datapaths in the generated HDL design. These slow paths correspond to slower Simulink sample time operations or even due to the algorithmic data-rate operating at a slower rate than the HDL clock rate.

Clock-rate pipelining identifies the maximal subregions in the model operating at the same data rate, and are delimited either by rate-change blocks or delay blocks. These subregions are called clock-rate regions because they make good candidates for clock-rate pipelining. If the output of a clock-rate region is a Delay block at the data rate, then HDL Coder absorbs that Delay block. This allows a budget of several clock-rate pipelines corresponding to the ratio of data rate to clock rate.

Consider the Field-Oriented control example Field-Oriented Control of a Permanent Magnet Synchronous Machine. It describes a motor-control design to be mapped to an FPGA. The input samples in this design are arriving every 20 $\mu s$ or 50 KHz. In a closed control loop, it is essential that the controller's latency is within the desired response time. In this model, there is a delay on the output port resulting in a latency of 20 $\mu s$.

To meet design constraints like timing and area, we may want to apply several optimizations like input/output pipelining, distributed pipelining, streaming and/or sharing. Further, non-trivial math functions like sqrt or divide may have to be implemented as multi-cycle pipelined operations. Pipelines introduced by any of the above features and optimizations are applied at the same rate at which the signal path operates, which is 20 $\mu s$. Thus, introducing any additional pipelining introduces undesirable latency overhead and may violate the closed loop latency budget.

However, the FPGA can implement this controller in the order of MHz, which means that the introduced pipelines can then operate at the MHz rate thereby minimizing the impact on latency. Clock-rate pipelining is a technique to leverage this rate differential, pipeline the controller and thereby improve its area and timing characteristics on the FPGA. This example walks through the steps for taking this design and incrementally applying timing and area optimizations using clock-rate pipelining.

Preparing the model

An important first step in applying clock-rate pipelining is to prepare the model so that it is amenable to clock-rate pipelining.Below are some of the main steps:

  • Defining the rate differential: Signal paths in Simulink end up on slow paths in HDL because of two primary reasons. First, the signal path is operating at a sample time that is slower than the base sample time of the model. Second, the Simulink base sample time may correspond to the data-rate instead of the clock-rate. For example, the base sample time in the hdlcoderFocCurrentFixptHdl model is 20 $\mu$ secs. The final FPGA implementation of the controller may target 40 MHz (or 25 ns).

open_system('hdlcoderFocCurrentFixptHdl')

The trouble with setting the model's sample time to 25 ns is that it drastically slows down Simulink simulation performance. To get around this, HDL Coder provides a setting, called Oversampling factor which specifies how much faster the FPGA clock rate runs with respect to the Simulink base sample time. Thus, in this case, we require a 800x oversampling.

  • Set optimizations on subsystems: For fixed-point designs, clock-rate pipelining is applied on a Subsystem only when the coder needs to insert pipelining. HDL Coder options that result in introduction of pipelining are distributed pipelining, sharing, streaming, input/output pipelining, constrained output pipeline, adaptive pipelining and any block implementations that introduce multi-cycle implementations including floating point implementations (for more details, refer to the documentation on individual blocks to understand the impact of their HDL properties on latency. Optimizations can be applied either locally, by maintaining the subsystem hierarchies or globally if the underlying subsystems are all flattened. In the former case, apply pipelining and optimization settings on individual subsystems and in the global case, these settings should be on the top-level subsystem. See Hierarchy Flattening for more information on prerequisites for Hierarchy Flattening.

Applying Clock-rate Pipelining

Now, we are ready to apply clock-rate pipelining. The feature option is on by default and will automatically find clock-rate regions. See Clock-Rate Pipelining to understand how the pipeline budget is determined and how clock-rate regions are formed.

srcHdlModel = 'hdlcoderFocCurrentFixptHdl';
dstHdlModel = 'hdlcoderFocClockRatePipelining';
dstHdlDut   = [dstHdlModel '/FOC_Current_Control'];
gmHdlModel  = ['gm_' dstHdlModel];
gmHdlDut    = ['gm_' dstHdlDut];

open_system(srcHdlModel);
save_system(srcHdlModel,dstHdlModel);

The subsystem FOC_Current_Control contains the algorithm from which we will generate HDL code

open_system(dstHdlDut);

We can now configure the model to use clock-rate pipelining.

hdlset_param(dstHdlModel, 'ClockRatePipelining', 'on');
hdlset_param(dstHdlModel, 'Oversampling', 800);
hdlset_param(dstHdlDut, 'DistributedPipelining', 'on');
set_param([dstHdlDut '/DQ_Current_Control/D_Current_Control'], 'TreatAsAtomicUnit', 'off');
set_param([dstHdlDut '/DQ_Current_Control/Q_Current_Control'], 'TreatAsAtomicUnit', 'off');

hdlset_param([dstHdlDut '/DQ_Current_Control/D_Current_Control'], 'DistributedPipelining', 'on');
hdlset_param([dstHdlDut '/DQ_Current_Control/Q_Current_Control'], 'DistributedPipelining', 'on');

hdlset_param([dstHdlDut '/Clarke_Transform'], 'DistributedPipelining', 'on');
hdlset_param([dstHdlDut '/Park_Transform'], 'DistributedPipelining', 'on');
hdlset_param([dstHdlDut '/Sine_Cosine'], 'DistributedPipelining', 'on');
hdlset_param([dstHdlDut '/Inverse_Park_Transform'], 'DistributedPipelining', 'on');
hdlset_param([dstHdlDut '/Inverse_Clarke_Transform'], 'DistributedPipelining', 'on');
hdlset_param([dstHdlDut '/Space_Vector_Modulation'], 'DistributedPipelining', 'on');

save_system(dstHdlModel);

To see the impact of clock-rate pipelining, generate HDL code and look inside the top-level subsystem of the generated model.

makehdl(dstHdlDut);
### Generating HDL for 'hdlcoderFocClockRatePipelining/FOC_Current_Control'.
### Using the config set for model <a href="matlab:configset.showParameterGroup('hdlcoderFocClockRatePipelining', { 'HDL Code Generation' } )">hdlcoderFocClockRatePipelining</a> for HDL code generation parameters.
### Starting HDL check.
### To highlight blocks that obstruct distributed pipelining, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocClockRatePipelining/highlightDistributedPipeliningBarriers')">hdlsrc/hdlcoderFocClockRatePipelining/highlightDistributedPipeliningBarriers.m</a>
### To clear highlighting, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocClockRatePipelining/clearhighlighting.m')">hdlsrc/hdlcoderFocClockRatePipelining/clearhighlighting.m</a>
### Generating new validation model: <a href="matlab:open_system('gm_hdlcoderFocClockRatePipelining_vnl')">gm_hdlcoderFocClockRatePipelining_vnl</a>.
### Validation model generation complete.
### Begin VHDL Code Generation for 'hdlcoderFocClockRatePipelining'.
### MESSAGE: The design requires 800 times faster clock with respect to the base rate = 2e-05.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/DQ_Current_Control/D_Current_Control/Saturate_Output as hdlsrc/hdlcoderFocClockRatePipelining/Saturate_Output.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/DQ_Current_Control/D_Current_Control as hdlsrc/hdlcoderFocClockRatePipelining/D_Current_Control.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/DQ_Current_Control/Q_Current_Control/Saturate_Output as hdlsrc/hdlcoderFocClockRatePipelining/Saturate_Output_block.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/DQ_Current_Control/Q_Current_Control as hdlsrc/hdlcoderFocClockRatePipelining/Q_Current_Control.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/DQ_Current_Control as hdlsrc/hdlcoderFocClockRatePipelining/DQ_Current_Control.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/Sine_Cosine/Sine_Cosine_LUT as hdlsrc/hdlcoderFocClockRatePipelining/Sine_Cosine_LUT.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/Sine_Cosine as hdlsrc/hdlcoderFocClockRatePipelining/Sine_Cosine.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/Clarke_Transform as hdlsrc/hdlcoderFocClockRatePipelining/Clarke_Transform.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/Park_Transform as hdlsrc/hdlcoderFocClockRatePipelining/Park_Transform.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/Inverse_Park_Transform as hdlsrc/hdlcoderFocClockRatePipelining/Inverse_Park_Transform.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/Inverse_Clarke_Transform as hdlsrc/hdlcoderFocClockRatePipelining/Inverse_Clarke_Transform.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control/Space_Vector_Modulation as hdlsrc/hdlcoderFocClockRatePipelining/Space_Vector_Modulation.vhd.
### Working on FOC_Current_Control_tc as hdlsrc/hdlcoderFocClockRatePipelining/FOC_Current_Control_tc.vhd.
### Working on hdlcoderFocClockRatePipelining/FOC_Current_Control as hdlsrc/hdlcoderFocClockRatePipelining/FOC_Current_Control.vhd.
### Generating package file hdlsrc/hdlcoderFocClockRatePipelining/FOC_Current_Control_pkg.vhd.
### Generating HTML files for code generation report at <a href="matlab:web('/tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocClockRatePipelining/html/hdlcoderFocClockRatePipelining_codegen_rpt.html');">hdlcoderFocClockRatePipelining_codegen_rpt.html</a>
### Creating HDL Code Generation Check Report file:///tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocClockRatePipelining/FOC_Current_Control_report.html
### HDL check for 'hdlcoderFocClockRatePipelining' complete with 0 errors, 0 warnings, and 2 messages.
### HDL code generation complete.

We can review the generated model and observe that the entire design has been Clock-rate pipelined and is running at the fast rate. If there are subsystems in the generated model which are not clock rate pipelined then check (as mentioned above) if there were optimizations set on the subsystem in the original model.

open_system(gmHdlDut);
set_param(gmHdlModel, 'SimulationCommand', 'update');
set_param(gmHdlDut, 'ZoomFactor', 'FitSystem');

% Further, rate-transitions are introduced on the design inputs to bring them
% to the clock-rate, which is determined as the original base sample time divided by the Oversampling factor,
% which is 2e-5/800 = 2.5e-8 or 25 ns. All pipelines are introduced at this rate and are thus operating at the clock-rate.
% Finally, observe that the output-side delay has been replaced by a down-sampling
% rate transition bringing the signal back to the data-rate. The clock frequency of the design was improved by inserting pipelines at the clock-rate, without incurring any additional sample time delays.

As with all optimizations, it is recommended that the validation model and co-simulation model are generated and the user verifies that the functional behavior of the design is unchanged. The Verification describe these concepts in more depth.

Local optimization with subsystem options

The rate differential on slow path implies that computation along this path can take several clock cycles. Specifically, the allowed latency is defined by the clock-rate budget (see Clock-Rate Pipelining). Apart from adding pipelines to improve clock frequency, we could reuse hardware resources by leveraging the latency budget. Setting resource sharing options like StreamingFactor and SharingFactor in a slow-path region does exactly that. This section demonstrates how resource sharing is applied within clock-rate regions.

When resource sharing is applied to a clock-rate path, HDL Coder oversamples the shared resource architecture for time-multiplexing as illustrated in the Resource Sharing For Area Optimization. However, if sharing or streaming is requested in a slow datapath, then HDL Coder implements resource sharing without oversampling. To trigger such sharing, set either sharing or streaming on the subsystem on which you want to apply resource sharing or streaming.

srcHdlModel = 'hdlcoderFocClockRatePipelining';
dstHdlModel = 'hdlcoderFocSharing';
dstHdlDut   = [dstHdlModel '/FOC_Current_Control'];
gmHdlModel  = ['gm_' dstHdlModel];
gmHdlDut    = ['gm_' dstHdlDut];

open_system(srcHdlModel);
save_system(srcHdlModel,dstHdlModel);

open_system(dstHdlDut);
hilite_system([dstHdlDut '/Park_Transform']);
hilite_system([dstHdlDut '/Inverse_Park_Transform']);
hilite_system([dstHdlDut '/Clarke_Transform']);

The Park_Transform subsystem and the Inverse_Park_Transform subsystem each use 4 multipliers within them that can be potentially shared. Additionally, the Clarke_Transform subsystem and the Inverse_Clarke_Transform subsystem each use 2 gains, which may be potentially shared, unless they are simply power-of-2 gains, which results in shifts instead of multiplications. Hence, the gain in Inverse_Clarke_Transform cannot be shared. Now, we can set the appropriate sharing factors on each of the subsystem on which we want to apply resource sharing.

hdlset_param([dstHdlDut '/Park_Transform'], 'SharingFactor', 4);
hdlset_param([dstHdlDut '/Inverse_Park_Transform'], 'SharingFactor', 4);
hdlset_param([dstHdlDut '/Clarke_Transform'], 'SharingFactor', 2);

save_system(dstHdlModel);

makehdl(dstHdlDut);
### Generating HDL for 'hdlcoderFocSharing/FOC_Current_Control'.
### Using the config set for model <a href="matlab:configset.showParameterGroup('hdlcoderFocSharing', { 'HDL Code Generation' } )">hdlcoderFocSharing</a> for HDL code generation parameters.
### Starting HDL check.
### To highlight blocks that obstruct distributed pipelining, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocSharing/highlightDistributedPipeliningBarriers')">hdlsrc/hdlcoderFocSharing/highlightDistributedPipeliningBarriers.m</a>
### To clear highlighting, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocSharing/clearhighlighting.m')">hdlsrc/hdlcoderFocSharing/clearhighlighting.m</a>
### Generating new validation model: <a href="matlab:open_system('gm_hdlcoderFocSharing_vnl')">gm_hdlcoderFocSharing_vnl</a>.
### Validation model generation complete.
### Begin VHDL Code Generation for 'hdlcoderFocSharing'.
### MESSAGE: The design requires 800 times faster clock with respect to the base rate = 2e-05.
### Working on hdlcoderFocSharing/FOC_Current_Control/DQ_Current_Control/D_Current_Control/Saturate_Output as hdlsrc/hdlcoderFocSharing/Saturate_Output.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/DQ_Current_Control/D_Current_Control as hdlsrc/hdlcoderFocSharing/D_Current_Control.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/DQ_Current_Control/Q_Current_Control/Saturate_Output as hdlsrc/hdlcoderFocSharing/Saturate_Output_block.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/DQ_Current_Control/Q_Current_Control as hdlsrc/hdlcoderFocSharing/Q_Current_Control.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/DQ_Current_Control as hdlsrc/hdlcoderFocSharing/DQ_Current_Control.vhd.
### Working on Clarke_Transform_shared as hdlsrc/hdlcoderFocSharing/Clarke_Transform_shared.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/Clarke_Transform as hdlsrc/hdlcoderFocSharing/Clarke_Transform.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/Sine_Cosine/Sine_Cosine_LUT as hdlsrc/hdlcoderFocSharing/Sine_Cosine_LUT.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/Sine_Cosine as hdlsrc/hdlcoderFocSharing/Sine_Cosine.vhd.
### Working on Park_Transform_shared as hdlsrc/hdlcoderFocSharing/Park_Transform_shared.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/Park_Transform as hdlsrc/hdlcoderFocSharing/Park_Transform.vhd.
### Working on Inverse_Park_Transform_shared as hdlsrc/hdlcoderFocSharing/Inverse_Park_Transform_shared.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/Inverse_Park_Transform as hdlsrc/hdlcoderFocSharing/Inverse_Park_Transform.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/Inverse_Clarke_Transform as hdlsrc/hdlcoderFocSharing/Inverse_Clarke_Transform.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control/Space_Vector_Modulation as hdlsrc/hdlcoderFocSharing/Space_Vector_Modulation.vhd.
### Working on FOC_Current_Control_tc as hdlsrc/hdlcoderFocSharing/FOC_Current_Control_tc.vhd.
### Working on hdlcoderFocSharing/FOC_Current_Control as hdlsrc/hdlcoderFocSharing/FOC_Current_Control.vhd.
### Generating package file hdlsrc/hdlcoderFocSharing/FOC_Current_Control_pkg.vhd.
### Generating HTML files for code generation report at <a href="matlab:web('/tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocSharing/html/hdlcoderFocSharing_codegen_rpt.html');">hdlcoderFocSharing_codegen_rpt.html</a>
### Creating HDL Code Generation Check Report file:///tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocSharing/FOC_Current_Control_report.html
### HDL check for 'hdlcoderFocSharing' complete with 0 errors, 0 warnings, and 2 messages.
### HDL code generation complete.

We can review the generated model and observe that HDL Coder implements time-multiplexing in the clock-rate using knowledge of the available latency budget due to the slow datapath.

open_system(gmHdlDut);
set_param(gmHdlModel, 'SimulationCommand', 'update');
set_param(gmHdlDut, 'ZoomFactor', 'FitSystem');
hilite_system([gmHdlDut '/ctr_799']);
hilite_system([gmHdlDut '/ctr_7991']);
hilite_system([gmHdlDut '/Clarke_Transform/Clarke_Transform_shared']);
hilite_system([gmHdlDut '/Park_Transform/Park_Transform_shared']);
hilite_system([gmHdlDut '/Inverse_Park_Transform/Inverse_Park_Transform_shared']);

The time-multiplexing architecture, also known as the single-rate sharing architecture is described in Single-rate Resource Sharing Architecture. A global scheduler is created to enable and disable different regions of the design using enabled subsystems. The enable/disable control is implemented using a limited counter ctr_799 and ctr_7991 that counts to the latency budget (0 to 799). The shared regions are implemented as enabled subsystems that are enabled according to an automatically determined schedule order. In this design, we found 2 groups of multipliers that was shared by 4-ways and 1 group of multipliers that was shared by 2-ways. The multiplier count for the design has reduced from 20 to 13 without any latency penalties.

Global optimization with flattening

Global cross-subsystem optimizations can be applied by leveraging the subsystem-flattening feature. With flattening, there are more number of resources that can be shared at the same level of hierarchy. To trigger such sharing, set either sharing or streaming on the top-level subsystem. The sharing factor value chosen must be an upper bound. To determine a good value, the resource usage of the design must be analyzed.

srcHdlModel = 'hdlcoderFocCurrentFixptHdl';
dstHdlModel = 'hdlcoderFocSharingWithFlattening';
dstHdlDut   = [dstHdlModel '/FOC_Current_Control'];
gmHdlModel  = ['gm_' dstHdlModel];
gmHdlDut    = ['gm_' dstHdlDut];

open_system(srcHdlModel);
save_system(srcHdlModel,dstHdlModel);

open_system(dstHdlDut);

hdlset_param(dstHdlModel, 'ClockRatePipelining', 'on');
hdlset_param(dstHdlModel, 'Oversampling', 800);
hdlset_param(dstHdlDut, 'FlattenHierarchy', 'on');
hdlset_param(dstHdlDut, 'DistributedPipelining', 'on');

hilite_system([dstHdlDut '/Park_Transform']);
hilite_system([dstHdlDut '/Inverse_Park_Transform']);
hilite_system([dstHdlDut '/Clarke_Transform']);
hilite_system([dstHdlDut '/Inverse_Clarke_Transform']);

The Park_Transform subsystem and the Inverse_Park_Transform subsystem each use 4 multipliers within them that can be potentially shared. Additionally, the Clarke_Transform subsystem and the Inverse_Clarke_Transform subsystem each use 2 gains, which may be potentially shared, unless they are simply power-of-2 gains, which results in shifts instead of multiplications. Therefore, we can choose the upper-bound value of 4 for SharingFactor and generate code.

hdlset_param(dstHdlDut, 'SharingFactor', 4);
save_system(dstHdlModel);

makehdl(dstHdlDut);
### Generating HDL for 'hdlcoderFocSharingWithFlattening/FOC_Current_Control'.
### Using the config set for model <a href="matlab:configset.showParameterGroup('hdlcoderFocSharingWithFlattening', { 'HDL Code Generation' } )">hdlcoderFocSharingWithFlattening</a> for HDL code generation parameters.
### Starting HDL check.
### To highlight blocks that obstruct distributed pipelining, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocSharingWithFlattening/highlightDistributedPipeliningBarriers')">hdlsrc/hdlcoderFocSharingWithFlattening/highlightDistributedPipeliningBarriers.m</a>
### To clear highlighting, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocSharingWithFlattening/clearhighlighting.m')">hdlsrc/hdlcoderFocSharingWithFlattening/clearhighlighting.m</a>
### Generating new validation model: <a href="matlab:open_system('gm_hdlcoderFocSharingWithFlattening_vnl')">gm_hdlcoderFocSharingWithFlattening_vnl</a>.
### Validation model generation complete.
### Begin VHDL Code Generation for 'hdlcoderFocSharingWithFlattening'.
### MESSAGE: The design requires 800 times faster clock with respect to the base rate = 2e-05.
### Working on crp_temp_shared as hdlsrc/hdlcoderFocSharingWithFlattening/crp_temp_shared.vhd.
### Working on crp_temp_shared_block as hdlsrc/hdlcoderFocSharingWithFlattening/crp_temp_shared_block.vhd.
### Working on crp_temp_shared_block1 as hdlsrc/hdlcoderFocSharingWithFlattening/crp_temp_shared_block1.vhd.
### Working on crp_temp_shared_block2 as hdlsrc/hdlcoderFocSharingWithFlattening/crp_temp_shared_block2.vhd.
### Working on crp_temp_shared_block3 as hdlsrc/hdlcoderFocSharingWithFlattening/crp_temp_shared_block3.vhd.
### Working on FOC_Current_Control_tc as hdlsrc/hdlcoderFocSharingWithFlattening/FOC_Current_Control_tc.vhd.
### Working on hdlcoderFocSharingWithFlattening/FOC_Current_Control as hdlsrc/hdlcoderFocSharingWithFlattening/FOC_Current_Control.vhd.
### Generating package file hdlsrc/hdlcoderFocSharingWithFlattening/FOC_Current_Control_pkg.vhd.
### Generating HTML files for code generation report at <a href="matlab:web('/tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocSharingWithFlattening/html/hdlcoderFocSharingWithFlattening_codegen_rpt.html');">hdlcoderFocSharingWithFlattening_codegen_rpt.html</a>
### Creating HDL Code Generation Check Report file:///tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocSharingWithFlattening/FOC_Current_Control_report.html
### HDL check for 'hdlcoderFocSharingWithFlattening' complete with 0 errors, 0 warnings, and 2 messages.
### HDL code generation complete.

We can review the generated model and observe that HDL Coder implements time-multiplexing in the clock-rate using knowledge of the available latency budget due to the slow datapath.

open_system(gmHdlDut);
set_param(gmHdlModel, 'SimulationCommand', 'update');
set_param(gmHdlDut, 'ZoomFactor', 'FitSystem');
hilite_system([gmHdlDut '/ctr_799']);
hilite_system([gmHdlDut '/crp_temp_shared']);
hilite_system([gmHdlDut '/crp_temp_shared1']);
hilite_system([gmHdlDut '/crp_temp_shared2']);
hilite_system([gmHdlDut '/crp_temp_shared3']);
hilite_system([gmHdlDut '/crp_temp_shared4']);

In this design, we found five groups of multipliers that were shared by 4-ways or less. These 5 subsystems have crp_temp_shared as part of their names.

In summary, the multiplier count for the design has reduced from 20 to 7 without any latency penalties as opposed to 13 when the design was not flattened.

Minimizing latency

As an advanced maneuver, it is possible to reduce the output latency by removing the output Delay_Register and instead using the option to allow clock-rate pipelining of DUT output ports.

srcHdlModel = 'hdlcoderFocSharing';
dstHdlModel = 'hdlcoderFocMinLatency';
dstHdlDut   = [dstHdlModel '/FOC_Current_Control'];
gmHdlModel  = ['gm_' dstHdlModel];
gmHdlDut    = ['gm_' dstHdlDut];

open_system(srcHdlModel);
save_system(srcHdlModel,dstHdlModel);

delete_line(dstHdlDut,'Space_Vector_Modulation/1','Delay_Register/1');
delete_line(dstHdlDut,'Delay_Register/1','Phase_Voltage/1');
delete_block([dstHdlDut,'/Delay_Register'])
add_line(dstHdlDut,'Space_Vector_Modulation/1','Phase_Voltage/1');

open_system(dstHdlDut);

The clock-rate pipelining for output ports option is available in the configuration parameters dialog under the 'HDL Code Generation' -> 'Optimization' -> 'Pipelining' tab: check the 'Allow clock-rate pipelining of DUT output ports' option. This command-line property name for this option is 'ClockRatePipelineOutputPorts'. When the 'ClockRatePipelineOutputPorts' option is turned on and the output register removed, the generated HDL code does not wait for the full sample step to generate the output. Rather, it will generate the output within a few clock cycles as soon as the data is ready. The generated HDL code will generate the output at the clock-rate without waiting for the next sample step.

hdlset_param(dstHdlModel, 'ClockRatePipelineOutputPorts', 'on');
save_system(dstHdlModel);

makehdl(dstHdlDut);
### Generating HDL for 'hdlcoderFocMinLatency/FOC_Current_Control'.
### Using the config set for model <a href="matlab:configset.showParameterGroup('hdlcoderFocMinLatency', { 'HDL Code Generation' } )">hdlcoderFocMinLatency</a> for HDL code generation parameters.
### Starting HDL check.
### Clock-rate pipelining was applied on signals connected to the DUT's output ports. The DUT output port values are therefore updated at the clock-rate. The following ports are phase-offset by the stated number of clock cycles.
### Phase of output port 0: 11 clock cycles.
### To highlight blocks that obstruct distributed pipelining, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocMinLatency/highlightDistributedPipeliningBarriers')">hdlsrc/hdlcoderFocMinLatency/highlightDistributedPipeliningBarriers.m</a>
### To clear highlighting, click the following MATLAB script: <a href="matlab:run('hdlsrc/hdlcoderFocMinLatency/clearhighlighting.m')">hdlsrc/hdlcoderFocMinLatency/clearhighlighting.m</a>
### Generating new validation model: <a href="matlab:open_system('gm_hdlcoderFocMinLatency_vnl')">gm_hdlcoderFocMinLatency_vnl</a>.
### Validation model generation complete.
### Begin VHDL Code Generation for 'hdlcoderFocMinLatency'.
### MESSAGE: The design requires 800 times faster clock with respect to the base rate = 2e-05.
### Working on Clarke_Transform_shared as hdlsrc/hdlcoderFocMinLatency/Clarke_Transform_shared.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/Clarke_Transform as hdlsrc/hdlcoderFocMinLatency/Clarke_Transform.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/DQ_Current_Control/D_Current_Control/Saturate_Output as hdlsrc/hdlcoderFocMinLatency/Saturate_Output.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/DQ_Current_Control/D_Current_Control as hdlsrc/hdlcoderFocMinLatency/D_Current_Control.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/DQ_Current_Control/Q_Current_Control/Saturate_Output as hdlsrc/hdlcoderFocMinLatency/Saturate_Output_block.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/DQ_Current_Control/Q_Current_Control as hdlsrc/hdlcoderFocMinLatency/Q_Current_Control.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/DQ_Current_Control as hdlsrc/hdlcoderFocMinLatency/DQ_Current_Control.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/Inverse_Clarke_Transform as hdlsrc/hdlcoderFocMinLatency/Inverse_Clarke_Transform.vhd.
### Working on Inverse_Park_Transform_shared as hdlsrc/hdlcoderFocMinLatency/Inverse_Park_Transform_shared.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/Inverse_Park_Transform as hdlsrc/hdlcoderFocMinLatency/Inverse_Park_Transform.vhd.
### Working on Park_Transform_shared as hdlsrc/hdlcoderFocMinLatency/Park_Transform_shared.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/Park_Transform as hdlsrc/hdlcoderFocMinLatency/Park_Transform.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/Sine_Cosine/Sine_Cosine_LUT as hdlsrc/hdlcoderFocMinLatency/Sine_Cosine_LUT.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/Sine_Cosine as hdlsrc/hdlcoderFocMinLatency/Sine_Cosine.vhd.
### Working on hdlcoderFocMinLatency/FOC_Current_Control/Space_Vector_Modulation as hdlsrc/hdlcoderFocMinLatency/Space_Vector_Modulation.vhd.
### Working on FOC_Current_Control_tc as hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control_tc.vhd.
### Working on FOC_Current_Control as hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control.vhd.
### Generating package file hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control_pkg.vhd.
### Generating HTML files for code generation report at <a href="matlab:web('/tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocMinLatency/html/hdlcoderFocMinLatency_codegen_rpt.html');">hdlcoderFocMinLatency_codegen_rpt.html</a>
### Creating HDL Code Generation Check Report file:///tmp/BR2020bd_1444674_32127/publish_examples0/tpb1b885ef/hdlsrc/hdlcoderFocMinLatency/FOC_Current_Control_report.html
### HDL check for 'hdlcoderFocMinLatency' complete with 0 errors, 0 warnings, and 2 messages.
### HDL code generation complete.

Notice that the 'makehdl' command has generated a message, '### Phase of output port 0:'. This message instructs the user on how to sample the DUT's outputs. The number of clock cycles specified here corresponds to how quickly the DUT's outputs can be sampled and, in essence, this is the latency of the design. Thus, the total latency of the design is down from a data-rate sample step of 20 $\mu s$ to a few nanoseconds.

We can review the generated model to observe that a new DUT subsystem is created whose output operates at the clock-rate, which is 25 ns.

open_system(gmHdlDut);
set_param(gmHdlModel, 'SimulationCommand', 'update');
set_param(gmHdlDut, 'ZoomFactor', 'FitSystem');

We must be careful when using this option since additional latency is introduced into the generated HDL code that was not in the original simulation model. In doing this, the sample-time of the output port has changed to the clock-rate. This introduces a possible discrepancy in results during the validation and verification flow since the test-harness expects the design to generate outputs at the data-rate. The validation model addresses this problem by inserting a down-sampling rate-transition to bring the output back to the data-rate. Thus, the validation model still compares outputs at the data-rate. The HDL testbench will, however, compare the new DUT's outputs at the clock-rate since the generated HDL outputs are emitted at the clock-rate.

Fine-tuning for performance

While this example illustrates the basic workflow to use clock-rate pipelining to minimize design latency, there are many other options available for fine-tuning HDL performance. The following are tips to leverage the feature's full potential. Note that these guidelines may not correspond to good modeling practices, but rather they are good practices for preparing your implementation model for HDL code generation and optimization.

  • Multi-rate designs: In this example, the source model is operating at a single rate, which is the data-rate. The Oversampling factor option specifies its relationship to the clock-rate. This setup works best for minimizing design latency. Clock-rate pipelining also works well in multi-rate designs by optimizing the slow-paths, but may introduce sample delays at the rate-transition boundaries. Thus, for minimizing latency, use a single-rate (the data-rate) for the whole design.

  • Clock frequency: You will notice in this design that distributed pipelining did not pipeline the whole datapath. This is because the optimization is cognizant of the consequences of retiming across certain blocks that may cause a numerical mismatch; see the distributed pipelining documentation for more details. Often, these numerical integrity issues occur at boundary conditions. If your design does not hit these boundary conditions, you can enable the Distributed pipelining priority parameter. In this case, the you must go through validation to confirm that design is working properly and is robust to all operating conditions.

  • Flatten hierarchy : You can turn on hierarchy flattening to either maximize global cross-subsystem optimizations and/or to improve the effectiveness of clock rate pipelining in the presence of feedback loops. In particular, when there are feedback loops that cross subsystem boundaries, it is recommended to turn on hierarchy flattening in the highest-level subsystem that contains the feedback loop. However, to be effective, please check that all the requirements for Hierarchy Flattening are satisfied for the lower level subsystems.

  • Provide sufficient budget: When the total number of clock-rate pipelines applied is equal to or more than the available oversampling budget, then understanding the timing impact can be hard. Therefore, provide sufficient budget, or value of Oversampling factor, for clock-rate pipelining. The only drawback of too big of an oversampling value is that the counters used by the timing controller and scheduler may be larger. The area overhead is, therefore, quite small.

Summary

Clock-rate pipelining is a technique to optimize and pipeline slow paths in your design. Clock-rate pipelining ensures that pipelines are introduced at the clock-rate for the following HDL Coder constructs and features:

  • Pipelined math operations: Several math blocks implement a multi-cycle, pipelined HDL implementation, e.g., Newton-Rhapson method for sqrt or recip, Cordic algorithm for trigonometric functions. These pipelines are introduced at clock-rate if the block operates on a slow path.

  • Floating point mapping: As described above, floating point library mapping utilizes clock-rate pipelines when implementing floating point math.

  • Pipelining optimizations: All pipelining optimizations including input/output pipelining, adaptive pipelining and distributed pipelining use clock-rate registers on slow paths.

  • Resource sharing and streaming: Time-multiplexing of resource-shared architectures are implemented at the clock-rate.

Slow paths are identified as paths using a slower Simulink sample time or when Oversampling factor is set in the HDL Coder settings. Using clock-rate pipelining, the design's speed and area properties can be improved without compromising the design's total latency.