1 Introduction
2 Designing HDL and simulation
3 Constraints and optimization methodology
4 Application of DFT technique
5 DFT enabled circuit analysis and fault coverage
6 Timing analysis and SDC creation
7 Formal verification: gate level netlist Vs. testable netlist
8 Power analysis
9 Tool automation
10 Conclusion
1 Introduction
Owing to its faster computation capabilities parallel processing algorithm employ systolic array architectures. Matrix multiplication schemes are generally implemented using this architecture. 3x3 matrices are fed to the systolic processor. Let A and B be the matrices to be multiplied and C be the resultant matrix. Row elements of A and column elements of matrix B are fed to 9 systolic processors which calculate the resultant matrix. Row elements pass through each processor from left to right while column elements are passed down the array. Once the matrix elements are completed, dummy values are passed till the processor pipeline is filled with required elements. For every clock cycle all systolic processors are activated and the multiplication and addition operations are performed on matrix elements. Partial results are accumulated in a register array. Once the pipeline is completely filled, results are available and they are fed to output registers in row wise manner.
The methodology followed in this assignment for the design of Systolic Array Matrix Multiplier (SAMM) is shown in the Figure (1). The algorithm used for the design and HDL code description and its simulation waveform analysis is explained in the section 2.2. For the functional verification of the design Modelsim V.6.0 is used. The RTL description is synthesized with suitable constraints using Design Compiler (DC) from Synopsys. DFT related circuits are inserted using built in DFT compiler of DC and related TCL commands are discussed in section 4. Timing analysis of DFT enabled netlist is carried out in section 5. The section 6 does the static timing analysis of the DFT enabled netlist using the tool PrimeTime from synopsys. Formal verification of the gate level netlist Vs DFT enabled netlist and RTL description Vs DFT netlist is carried out using the tool Formality from Synopsys in section 7. The section 8 discusses power report generated at different stage of these front end design flow. In last section tool automation using script file is discussed. The assignment concludes with leaving a note on further possible work in the SAMM.
Figure (1) ASIC flow employed in SAMM
2 Designing HDL and simulation
HDL code for the 3x3 matrix multiplication is written in verilog and functionally simulated using Modelsim SE 6.0. The logic implemented in the code is explained in the next section.
2.2.1 Systolic Array Matrix Multiplication (SAMM)
The logic behind the systolic array matrix multiplier which is implemented in HDL code is shown in Figure (2.a-h). Each square box represent single processing element. Processing elements are thus arranged in 2-D grid. Matrix A and B are fed to these PEs. For every clock pulse the way in which every matrix element are fed is depicted in the Figure (2) starting from (a) to (h). From the matrix A, individual rows are fed to leftmost PEs.
Figure (2.a-h) systolic array matrix multiplication [1] Download
Similarly column of the matrix B are fed to topmost PEs. For the first clock cycle only the first PE gets the elements; remaining all produce zero output. For the second clock cycle rows of the matrix A are shifted to next set of PEs from left to right. Similarly column of matrix B is also moved from topmost PEs to the next set of the bottom PEs. Each PE multiplies and accumulates the result. The shifting of row and column elements and mac-process continues till the 7th clock pulse. For the 7th clock resultant matrix values are available in each PE.
2.2 HDL implementation
Each PE is made up of multiplier and adder unit. Two basic modules –multiplier (mult.v) and adder (adder.v) are instantiated in MAC module (mac.v). The block diagram of the MAC module is shown in Figure (3). Multiplier output becomes one of the inputs to the adder and previous output of adder itself becomes another input to adder as shown in the Figure (3). Thus adder acts as an accumulator.
Figure (3) MAC block diagram
Memories are used to store and shift the row elements (a_row0, a_row1, a_row2) and column elements (b_col0, b_col1, b_col2). The block diagram representation of HDL code with used variables and naming conventions is shown in Figure (4). For the first clock pulse 3 row elements and 3 column elements are stored to the 0th location of the memory (aa_row0[0],aa_row1[0] and aa_row2[0]). For the next clock pulse data in the 0th location is shifted to the first location (aa_row0[1],aa_row1[1] and aa_row2[1]) and from the first to second(aa_row0[2],aa_row1[2] and aa_row2[2]). 0th location receives the new matrix element. MAC of each PEs takes the matrix elements from these memories. The leftmost row of the PES takes row element from 0th location of corresponding memories. The middle set of PEs take from 1st and the rightmost takes from second memory location.
Figure (4)) HDL implementation block diagram
Similarly, topmost PEs takes 0th location elements (bb_col0[0], bb_1[0], bb_col2[0]) from the column storage memories; middle set receives from 1st location(bb_col0[1], bb_1[1], bb_col2[1]) and bottom set of PEs take 2nd column element(bb_col0[2], bb_1[2], bb_col2[2]) from the memory. Shifting of these matrix elements can be observed in the highlighted area of simulation waveforms shown the Figure (5).For every clock pulse output registers (out_reg00, out_reg01, out_reg02, out_reg10, out_reg11, out_reg12, out_reg20, out_reg21, out_reg22) are updated with the new values of MAC outputs.(see Figure (2.5)). Since MAC and output registers produce a total delay of 3 clock cycles, after the 10th (i.e. for the 11th +ve edge of clock) cycle output is available.
Figure (5) SAMM simulation waveforms
3 Constraints and optimization methodology
3.1 Constraints
Three types of constraints can be set for the design in Design Compiler (DC). They are DRC constraints, optimization constraints and environmental constraints. DRC constraints exist in library. DRC constraints can’t be relaxed. They can be chosen from library. DRC constraints are: set_max_fanout, set_max_transition, set_max_capacitance. If DRC constraints are not specified, then default values from the library are taken. For SAMM no DRC constraints are specified.
Two types of optimizations are possible-area, power and timing. We have optimization constraints related to all these. set_max_area, set_min_area are area constraints and since area is not an issue these are not specified for SAMM. Only basic level of power optimization is carried out by DC. Its primary target is to meet timing constraints. set_max_leakege and set_max_dynamic are the two power constraints that can be provided to DC. Both DRC and optimization constraints follow environmental constraints. Setting up of operating conditions and wire load model falls under environmental constraints. The constraints are: set_operating_conditions, set_wire_load_model and set_wire_load_mode. None of these are specified for SAMM. By default enclosed wire load mode is considered by DC.
Synthesis is timing driven process. Several timing constraints are put to synthesis process of SAMM. No timing specifications are mentioned for SAMM. Hence to extract the possible value of clock, derive_timing_constraints command is used. This gives a clock period of 2. Experimentally clock period of 4.75 is chosen. This value of clock satisfy slack requirement of DFT enabled circuit.
Timing constraints provided for SAMM are listed bellow:
- set_clock -period 4.75 clock: clock period constraint set at 4.75 (210 MHz).
- set_clock_uncertainty –setup 0.475 clock: -ve clock skew can lead to setup violations. Possible value of –ve skew is provided to DC so that it can model for that. Generally setup uncertainty is taken as 10%.
- set_clock_uncertainty –hold 0.27 clock: +ve clock skew can lead to hold violations. Possible value of +ve skew is provided to DC so that it can model for that. Generally hold uncertainty is taken as 5%.
- set_clock_latency 0.45 clock: this provides possible network latency constraint to DC.
- set_clock_latency –source 0.4 clock: source latency of 0.45 is selected.
- set_clock_transition 0.04 clock: clock transition time of 0.04 is modeled.
- set_input_delay 0.40 [all_inputs]: input delay of 0.4 is set to all inputs.
- remove_input_delay [get_ports clock]: constraining clock with input delay leads to wrong timing analysis. To exclude clock port from the input delay this command is used.
- set_output_delay 0.40 [all_outputs]: output delay of 0.4 is provided. Since all outputs are registered this delay does not affect the timing analysis of current design SAMM.
I/O ports of the design become pads of the IC. Hence tool has to be informed about this so that it analyzes delay, area and power appropriately. This is done using command set_port_is_pad, which sets the I/O ports as pad and insert_pad, which inserts pad.
3.2 Optimization methodology
DC uses cost functions to optimize the design. DC calculates the cost functions based on the design constraints and DRCs to optimize the design. [2]. Optimization also depends upon the compilation strategy adopted. There are 4 types of compilation strategies recommended by DC. They are: top-down hierarchical compile method, time budget compile method, compile-characterize-write script-recompile (CCWSR) method and design budgeting method [3]. In this assignment top down hierarchical compile strategy is adopted.
In this method the source is compiled by reading the entire design. Constraints and attributes are applied based on the design specification. Only top level constraints are needed in this method. Even though the design had several sub modules, only one set of constraints are applied. Because of this entire design are optimized yielding better results. This method works well because the design SAMM does not has multiple clocks or generated clocks. To get better results compile_map_effort high is used. This command enables DC to maximize its effort to meet the specified constraints. Once the compilation is completed several design related reports can be obtained. Important of them are: report_timing, report_area and report_power.
Area report is shown below. Since the pad is inserted it reports larger area requirement for the design. The combinational area shown is inclusive of pad area.
****************************************
Report : area
Design : sam3
Version: V-2004.06-SP1
Date : Mon Apr 23 16:25:56 2007
****************************************
Library(s) Used:
cb13io320_tsmc_max (File: /home/Master_Files/Libraries/cb13io320_tsmc_max.db)
cb13fs120_tsmc_max (File: /home/Master_Files/Libraries/cb13fs120_tsmc_max.db)
Number of ports: 58
Number of nets: 632
Number of cells: 523
Number of references: 26
Combinational area: 200895.812500
Noncombinational area: 2516.500000
Net Interconnect area: 955.047913
Total cell area: 203412.218750
Total area: 204367.359375
Total number of ports mentioned is 58. This covers 6 inputs of matrix A and B, each is of width 4 bits; 3 outputs of resultant matrix each of width 10 bit; and control signals mult_over, reset, clock and en.
The timing report for the gate level netlist of the design is shown below. Critical path has start point reset and endpoint out_reg00. (Practically total 7 critical paths are found in PrimeTime analysis; all start at reset and endpoint is output registers.). The design has only one clock group.
****************************************
Report : timing
-path full
-delay max
-max_paths 1
Design : sam3
Version: V-2004.06-SP1
Date : Mon Apr 23 16:25:57 2007
****************************************
Operating Conditions: cb13fs120_tsmc_max Library: cb13fs120_tsmc_max
Wire Load Model Mode: enclosed
Startpoint: reset (input port)
Endpoint: out_reg00_reg[2]
(rising edge-triggered flip-flop clocked by clock)
Path Group: clock
Path Type: max
Des/Clust/Port Wire Load Model Library
------------------------------------------------
sam3 280000 cb13fs120_tsmc_max
Point Incr Path
-----------------------------------------------------------
clock (input port clock) (rise edge) 0.00 0.00
input external delay 0.40 0.40 r
reset (in) 0.00 0.40 r
U638/CIN (pc3d11) 0.95 1.35 f
U327/ZN (inv0d1) 0.25 1.59 r
U362/ZN (nd12d0) 0.16 1.76 f
U361/ZN (inv0d1) 0.07 1.82 r
U360/ZN (inv0d1) 0.10 1.92 f
U352/ZN (inv0d1) 0.06 1.98 r
U351/ZN (inv0d1) 0.10 2.08 f
U342/ZN (inv0d1) 0.23 2.31 r
U375/ZN (nr02d0) 0.53 2.84 f
U370/ZN (inv0d1) 0.31 3.15 r
U346/ZN (inv0d1) 0.14 3.29 f
U345/ZN (inv0d1) 0.10 3.39 r
U326/Z (buffd1) 0.12 3.51 r
U328/ZN (inv0d1) 0.30 3.80 f
U431/Z (aor22d1) 0.28 4.08 f
out_reg00_reg[2]/D (dfnrq1) 0.00 4.08 f
data arrival time 4.08
clock clock (rise edge) 4.75 4.75
clock network delay (ideal) 0.90 5.65
clock uncertainty -0.47 5.18
out_reg00_reg[2]/CP (dfnrq1) 0.00 5.18 r
library setup time -0.07 5.10
data required time 5.10
-----------------------------------------------------------
data required time 5.10
data arrival time -4.08
-----------------------------------------------------------
slack (MET) 1.02
Since default wire load model is “enclosed”, DC has considered highest of them- for delay calculations. Changing wire load mode to “segmented” may result better timing analysis. Positive slack of 1.02 is obtained which is more than 15% of the clock.
4 Application of DFT technique
DFT techniques provide controllability and observability. Automatic Test Pattern Generation (ATPG) is used for combinational design and DFT is used for sequential circuits. Basically two methodologies are followed in the industry. They are: scan DFT methodology and Built In Self Test (BIST). Under scan DFT two sub categories are available known as full scan and partial scan. Generally tools support only full scan DFT. Built in DFT compiler of DC provide multiplexed full scan DFT. Two ways can be followed while inserting DFT elements to the design. The first one is to compile the design first and then insert scan elements. The second method is to insert scan elements to DFT and then compile. For the SAMM first method is followed.
To check whether design has any DFT rule violations dft_drc command is used. dft_drc check can be enabled by using commands set hdlin_enable_rtldrc_info true and set test_enable_dft_drc true. A test clock for DFT is generated using command create_test_clock clock waveform {25 50}. Period of the test clock is always greater than the original clock period. Total number of scan paths and methodology used are instructed using the command create_test_protocol. Longest chain length in a scan path is not mentioned. As per power report, this constraint increases power consumption. insert_scan or insert_dft command converts all Flip Flops to scan registers. The log data for these commands are shown below.
Information: Using default scan style 'multiplexed_flip_flop'. (TESTDB-279)
Loading db file '/home/Master_Files/Libraries/gtech.db'
Trget library 'gtech'
Loading design ...
Starting rtldrc ...
Initializing rtldrc ...
Starting rule checks ...
Information: Scan style is 'multiplexed_flip_flop'. (TEST-1212)
Information: Starting test protocol creation. (TEST-219)
...reading user specified clock signals...
Information: Identified system/test clock port clock (25.0,50.0). (TEST-265)
...reading user specified asynchronous signals...
Loading test protocol
Loading target library 'cb13fs120_tsmc_max'
Loading target library 'cb13io320_tsmc_max'
Warning: IO pad 'pc3d10' is unusable: unknown logic function. (OPT-1022)
Warning: IO pad 'pc3d00' is unusable: unknown logic function. (OPT-1022)
Loading design 'sam3'
Pre-DFT DRC enabled
Information: Starting test design rule checking. (TEST-222)
...basic checks...
...basic sequential cell checks...
...checking for scan equivalents...
...checking vector rules...
...checking pre-dft rules...
-----------------------------------------------------------------
DRC Report
Total violations: 0
-----------------------------------------------------------------
Test Design rule checking did not find violations
-----------------------------------------------------------------
Sequential Cell Report
0 out of 467 sequential cells have violations
-----------------------------------------------------------------
SEQUENTIAL CELLS WITHOUT VIOLATIONS
* 467 cells are valid scan cells
From the above log data it can be observed that there are no DFT violations in the design. By default, multiplexed flip flop scan technique is used by the tool as circled in the above log data.
5 DFT enabled circuit analysis and fault coverage
In normal mode of operation DFT circuits are not utilized. Hence direct timing analysis of the DFT enabled netlist considers DFT related circuits and generates wrong analysis. To overcome this drawback the tool has to be instructed to exclude DFT circuitry from timing analysis. In the SAMM design due to scan enable signal, very large –ve slack (-105) is predicted. This problem is tackled using command set_case_analysis 0 test_se. By doing this tool is informed not to consider the scan enable signal test_se for normal timing analysis. Corresponding correct timing report after the execution of this command is shown below.
Startpoint: reset (input port)
Endpoint: out_reg00_reg[2]
(rising edge-triggered flip-flop clocked by clock)
Path Group: clock
Path Type: max
Des/Clust/Port Wire Load Model Library
------------------------------------------------
sam3 280000 cb13fs120_tsmc_max
Point Incr Path
-----------------------------------------------------------
clock (input port clock) (rise edge) 0.00 0.00
input external delay 0.40 0.40 r
reset (in) 0.00 0.40 r
U638/CIN (pc3d11) 0.95 1.35 f
U327/ZN (inv0d1) 0.25 1.59 r
………………………………… … ……………
out_reg00_reg[2]/D (sdnrq1) 0.00 4.19 f
data arrival time 4.19
clock clock (rise edge) 4.75 4.75
clock network delay (ideal) 0.90 5.65
clock uncertainty -0.47 5.18
out_reg00_reg[2]/CP (sdnrq1) 0.00 5.18 r
library setup time -0.19 4.98
data required time 4.98
-----------------------------------------------------------
data required time 4.98
data arrival time -4.19
-----------------------------------------------------------
slack (MET) 0.80
From the timing report it can be observed that slack of DFT enabled circuit has reduced. The power report, which is discussed in section 8, reports higher power consumption for DFT enabled circuit. It is very natural that DFT causes area overhead. From the previous section we have noticed that DFT improves testability. Thus the trade-off of testability and area and power are challenging issues for present ASIC designers.
Fault coverage of the DFT enabled design can be estimated using command estimate_test_coverage. The corresponding log data is shown below.
Starting test coverage estimation ...
12190 faults were added to fault list.
Uncollapsed Stuck Fault Summary Report
-----------------------------------------------
fault class code #faults
------------------------------ ---- ---------
Detected DT 12154
Possibly detected PT 0
Undetectable UD 36
ATPG untestable AU 0
Not detected ND 0
-----------------------------------------------
total faults 12190
test coverage 100.00%
-----------------------------------------------
Information: The test coverage above may be inferior
than the real test coverage with customized
protocol and test simulation library.
From the above data it can be observed that total 12190 faults are possible and all are testable making 100% fault coverage. But the tool informs that 100% coverage may not be practically possible.
6 Timing analysis and SDC creation
DFT enabled netlist is carried to the static time analysis tool PrimeTime. The target library of DC for the design SAMM now becomes link library for PT. Since the DFT enabled netlist is stored in .db format all constraints are also available in the netlist. Detailed timing information generated by PT GUI option is shown below.
****************************************
Report : timing
-path full
-delay max_fall
-input_pins
-nets
-max_paths 1
-transition_time
-capacitance
-crosstalk_delta
-trace_latch_borrow
Design : sam3
Version: X-2005.12-SP2
Date : Mon Apr 23 17:04:08 2007
****************************************
Startpoint: reset (input port)
Endpoint: out_reg00_reg[2]
(rising edge-triggered flip-flop clocked by clock)
Path Group: clock
Path Type: max
No time is borrowed from the startpoint of this path.
Startpoint: reset (input port)
Endpoint: out_reg00_reg[2]
(rising edge-triggered flip-flop clocked by clock)
Point Fanout Cap DTrans Trans Delta Incr Path
-----------------------------------------------------------------------------------------
clock (input port clock) (rise edge) 0.00 0.00
input external delay 0.40 0.40 r
reset (in) 0.00 0.00 0.40 r
reset (net) 1 8.24
U638/PAD (pc3d11) <- 0.00 0.00 0.00 0.04 0.44 r
U638/CIN (pc3d11) <- 0.16 0.91 1.35 f
n318 (net) 1 0.01
U327/I (inv0d1) <- 0.00 0.16 0.00 0.00 1.35 f
U327/ZN (inv0d1) <- 0.50 0.25 1.59 r
n114 (net) 11 0.07
U362/A2 (nd12d0) <- 0.00 0.50 0.00 0.01 1.60 r
U362/ZN (nd12d0) <- 0.22 0.15 1.76 f
n183 (net) 1 0.01
U361/I (inv0d1) <- 0.00 0.22 0.00 0.00 1.76 f
U361/ZN (inv0d1) <- 0.10 0.07 1.82 r
n178 (net) 1 0.01
U360/I (inv0d1) <- 0.00 0.10 0.00 0.00 1.82 r
U360/ZN (inv0d1) <- 0.16 0.10 1.92 f
n250 (net) 3 0.02
U352/I (inv0d1) <- 0.00 0.16 0.00 0.00 1.92 f
U352/ZN (inv0d1) <- 0.09 0.06 1.98 r
n252 (net) 1 0.01
U351/I (inv0d1) <- 0.00 0.09 0.00 0.00 1.98 r
U351/ZN (inv0d1) <- 0.16 0.10 2.08 f
n221 (net) 3 0.02
U342/I (inv0d1) <- 0.00 0.16 0.00 0.00 2.08 f
U342/ZN (inv0d1) <- 0.45 0.23 2.31 r
n223 (net) 11 0.06
U375/A1 (nr02d0) <- 0.00 0.45 0.00 0.01 2.32 r
U375/ZN (nr02d0) <- 1.00 0.54 2.86 f
n218 (net) 10 0.07
U370/I (inv0d1) <- 0.00 1.00 0.00 0.01 2.86 f
U370/ZN (inv0d1) <- 0.43 0.31 3.17 r
n254 (net) 7 0.04
U346/I (inv0d1) <- 0.00 0.43 0.00 0.00 3.18 r
U346/ZN (inv0d1) <- 0.19 0.15 3.33 f
n179 (net) 2 0.01
U345/I (inv0d1) <- 0.00 0.19 0.00 0.00 3.33 f
U345/ZN (inv0d1) <- 0.15 0.10 3.43 r
n257 (net) 3 0.02
U326/I (buffd1) <- 0.00 0.15 0.00 0.00 3.43 r
U326/Z (buffd1) <- 0.10 0.12 3.55 r
n217 (net) 2 0.01
U328/I (inv0d1) <- 0.00 0.10 0.00 0.00 3.55 r
U328/ZN (inv0d1) <- 0.68 0.34 3.88 f
n237 (net) 11 0.08
U431/A2 (aor22d1) <- 0.00 0.68 0.00 0.01 3.89 f
U431/Z (aor22d1) <- 0.11 0.29 4.19 f
N180 (net) 1 0.01
out_reg00_reg[2]/D (sdnrq1) 0.00 0.11 0.00 0.00 4.19 f
data arrival time 4.19
clock clock (rise edge) 0.04 4.75 4.75
clock network delay (ideal) 0.90 5.65
clock uncertainty -0.47 5.18
out_reg00_reg[2]/CP (sdnrq1) 5.18 r
library setup time -0.19 4.98
data required time 4.98
-----------------------------------------------------------------------------------------
data required time 4.98
data arrival time -4.19
-----------------------------------------------------------------------------------------
slack (MET) 0.80
The discussions on timing analysis over previous sections are also applicable here. From the above report it can be observed that slack has not changed from the DC timing report. The critical path is the timing path between reset and output register out_reg00. The timing report shown also gives information of fan-out and capacitance value of related cell or net. The net capacitance histogram is shown in the Figure (6) and end point and path slack histograms are shown in Figure (7). Total 26 nets have worst value of capacitance 8.24133. Two nets have capacitance ranging from 5.22 to 7.32. A maximum of 1775 nets have best capacitance value (i.e. lowest capacitance value) of 0.0013356. Path slack histogram shows slack distribution over different timing paths. There are total 18 paths which are having best slack of 0.882873. There are total 82 paths which are having slack ranging from 0.792 to 0.828. Endpoint slack is distributed all over the range from 0.795198 to 3.35118.
Figure (6) net capacitance histogram
Figure (7) end point and path slack histograms
SDC File: SDC file from the PrimeTime is generated using the command write_sdc. The generated SDC file is shown below. SDC file provides all constraints which are used in the design and are discussed in previous sections. Similarly we can generate .sdf file using command write_sdf which in addition to constraint information, also has delay information pertaining to cells and nets.
###############################################################################
created by PrimeTime write_sdc on Mon Apr 23 16:32:35 2007
###############################################################################
set sdc_version 1.5
###############################################################################
# Units
#capacitive_load_unit : 1 pF
# current_unit : 1e-06 A
# resistance_unit : 1 kOhm
# time_unit : 1 ns
# voltage_unit : 1 V
############################################################################
set_operating_conditions -library [get_libs \
{cb13fs120_tsmc_max.db:cb13fs120_tsmc_max}]
###############################################################################
# Clock Related Information
###############################################################################
create_clock -name clock -period 4.75 -waveform { 0 2.375 } [get_ports {clock}]
set_clock_latency -min 0.45 [get_clocks {clock}]
set_clock_latency -max 0.45 [get_clocks {clock}]
set_clock_latency -source -min 0.45 [get_clocks {clock}]
set_clock_latency -source -max 0.45 [get_clocks {clock}]
set_clock_uncertainty -setup 0.475 [get_clocks {clock}]
set_clock_uncertainty -hold 0.27 [get_clocks {clock}]
set_clock_transition -rise -max 0.04 [get_clocks {clock}]
set_clock_transition -fall -max 0.04 [get_clocks {clock}]
set_clock_transition -rise -min 0.04 [get_clocks {clock}]
set_clock_transition -fall -min 0.04 [get_clocks {clock}]
###############################################################################
# External Delay Information
###############################################################################
set_input_delay 0.4 [get_ports {{a_row0[3]}}]
set_input_delay 0.4 [get_ports {{a_row0[2]}}]
set_input_delay 0.4 [get_ports {{a_row0[1]}}]
set_input_delay 0.4 [get_ports {{a_row0[0]}}]
set_input_delay 0.4 [get_ports {{a_row1[3]}}]
set_input_delay 0.4 [get_ports {{a_row1[2]}}]
set_input_delay 0.4 [get_ports {{a_row1[1]}}]
set_input_delay 0.4 [get_ports {{a_row1[0]}}]
set_input_delay 0.4 [get_ports {{a_row2[3]}}]
set_input_delay 0.4 [get_ports {{a_row2[2]}}]
set_input_delay 0.4 [get_ports {{a_row2[1]}}]
set_input_delay 0.4 [get_ports {{a_row2[0]}}]
set_input_delay 0.4 [get_ports {{b_col0[3]}}]
set_input_delay 0.4 [get_ports {{b_col0[2]}}]
set_input_delay 0.4 [get_ports {{b_col0[1]}}]
set_input_delay 0.4 [get_ports {{b_col0[0]}}]
set_input_delay 0.4 [get_ports {{b_col1[3]}}]
set_input_delay 0.4 [get_ports {{b_col1[2]}}]
set_input_delay 0.4 [get_ports {{b_col1[1]}}]
set_input_delay 0.4 [get_ports {{b_col1[0]}}]
set_input_delay 0.4 [get_ports {{b_col2[3]}}]
set_input_delay 0.4 [get_ports {{b_col2[2]}}]
set_input_delay 0.4 [get_ports {{b_col2[1]}}]
set_input_delay 0.4 [get_ports {{b_col2[0]}}]
set_output_delay 0.4 [get_ports {{c_row0[9]}}]
set_output_delay 0.4 [get_ports {{c_row0[8]}}]
set_output_delay 0.4 [get_ports {{c_row0[7]}}]
set_output_delay 0.4 [get_ports {{c_row0[6]}}]
set_output_delay 0.4 [get_ports {{c_row0[5]}}]
set_output_delay 0.4 [get_ports {{c_row0[4]}}]
set_output_delay 0.4 [get_ports {{c_row0[3]}}]
set_output_delay 0.4 [get_ports {{c_row0[2]}}]
set_output_delay 0.4 [get_ports {{c_row0[1]}}]
set_output_delay 0.4 [get_ports {{c_row0[0]}}]
set_output_delay 0.4 [get_ports {{c_row1[9]}}]
set_output_delay 0.4 [get_ports {{c_row1[8]}}]
set_output_delay 0.4 [get_ports {{c_row1[7]}}]
set_output_delay 0.4 [get_ports {{c_row1[6]}}]
set_output_delay 0.4 [get_ports {{c_row1[5]}}]
set_output_delay 0.4 [get_ports {{c_row1[4]}}]
set_output_delay 0.4 [get_ports {{c_row1[3]}}]
set_output_delay 0.4 [get_ports {{c_row1[2]}}]
set_output_delay 0.4 [get_ports {{c_row1[1]}}]
set_output_delay 0.4 [get_ports {{c_row1[0]}}]
set_output_delay 0.4 [get_ports {{c_row2[9]}}]
set_output_delay 0.4 [get_ports {{c_row2[8]}}]
set_output_delay 0.4 [get_ports {{c_row2[7]}}]
set_output_delay 0.4 [get_ports {{c_row2[6]}}]
set_output_delay 0.4 [get_ports {{c_row2[5]}}]
set_output_delay 0.4 [get_ports {{c_row2[4]}}]
set_output_delay 0.4 [get_ports {{c_row2[3]}}]
set_output_delay 0.4 [get_ports {{c_row2[2]}}]
set_output_delay 0.4 [get_ports {{c_row2[1]}}]
set_output_delay 0.4 [get_ports {{c_row2[0]}}]
set_input_delay 0.4 [get_ports {en}]
set_input_delay 0.4 [get_ports {reset}]
set_output_delay 0.4 [get_ports {mult_over}]
set_case_analysis 0 [get_ports {test_se}]
set_wire_load_mode enclosed
One important point which can be observed from the above .sdc file is that it is more structured compared to the constraint file generated by the DC. .sdc file generated from PrimeTime is de facto standard in ASIC design industry.
7 Formal verification: gate level netlist vs. testable netlist
Formality is the tool used to formally verify the design. The design SAMM is verified in two ways. Gate level netlist and testable netlist are formally verified. Gate level netlist in .db format is taken as reference and testable netlist in .db format is considered as implementation. The matching of these two netlists generates below log data:
*********************************** Matching Results ***********************************
498 Compare points matched by name
0 Compare points matched by signature analysis
0 Compare points matched by topology
27 Matched primary inputs, black-box outputs
0(1) Unmatched reference(implementation) compare points
0(2) Unmatched reference(implementation) primary inputs, black-box outputs
----------------------------------------------------------------------------------------
Unmatched Objects REF IMPL
----------------------------------------------------------------------------------------
Input ports (Port) 0 2
Output ports (Port) 0 1
****************************************************************************************
There are total two unmatched input ports exists in implementation. These ports are test_se and test_si which are related to DFT circuit. Since under normal operation DFT circuits doesn’t come into picture tool has to be instructed not to match these ports. This is performed by setup option in GUI and setting those two signals as false as shown in below.
Formality (match)> setup
1
Formality (setup)> set_constant -type port i:/WORK/sam3/test_se 0
Set 'i:/WORK/sam3/test_se' to constant 0
1
Formality (setup)> set_constant -type port i:/WORK/sam3/test_si 0
Set 'i:/WORK/sam3/test_si' to constant 0
1
After setting these options designs match and verification succeeds. The log data of the same is shown below.
Status: Verifying...
********************************* Verification Results *********************************
Verification SUCCEEDED
----------------------
Reference design: r:/WORK/sam3
Implementation design: i:/WORK/sam3
480 Passing compare points
----------------------------------------------------------------------------------------
Matched Compare Points BBPin Loop BBNet Cut Port DFF LAT TOTAL
----------------------------------------------------------------------------------------
Passing (equivalent) 0 0 0 0 31 449 0 480
Failing (not equivalent) 0 0 0 0 0 0 0 0
Not Compared
Constant reg 18 0 18
****************************************************************************************
1
Similarly original verilog code and testable netlist is verified.
8 Power analysis
Stand alone power analysis tools are necessary to accurately analyze the power requirement of the design. Nevertheless, DC moderately does power analysis. We can set the effort to be high by using command report_power –analysis_effort high. Power report of the gate level netlist with low effort is shown below. Operating conditions and wire load models set for the design affect power analysis. In this case default conditions are set.
****************************************
Report : power
-analysis_effort low
Design : sam3
Version: V-2004.06-SP1
Date : Mon Apr 23 16:25:57 2007
****************************************
Library(s) Used:
cb13io320_tsmc_max (File: /home/Master_Files/Libraries/cb13io320_tsmc_max.db)
cb13fs120_tsmc_max (File: /home/Master_Files/Libraries/cb13fs120_tsmc_max.db)
Operating Conditions: cb13fs120_tsmc_max Library: cb13fs120_tsmc_max
Wire Load Model Mode: enclosed
Design Wire Load Model Library
------------------------------------------------
sam3 280000 cb13fs120_tsmc_max
mac_8 8000 cb13fs120_tsmc_max
mult_8 ForQA cb13fs120_tsmc_max
mult_8_DW02_mult_4_4_0 ForQA cb13fs120_tsmc_max
………. ….. ………..
………. ….. ……….
adder_0 ForQA cb13fs120_tsmc_max
adder_0_DW01_add_10_0 ForQA cb13fs120_tsmc_max
Global Operating Voltage = 1.08
Power-specific unit information :
Voltage Units = 1V
Capacitance Units = 1.000000pf
Time Units = 1ns
Dynamic Power Units = 1mW (derived from V,C,T units)
Leakage Power Units = 1pW
Cell Internal Power = 108.5362 mW (98%)
Net Switching Power = 1.9321 mW (2%)
---------
Total Dynamic Power = 110.4683 mW (100%)
Cell Leakage Power = 33.3949 uW
Note that I/O pads are already inserted and with this consideration the total power is estimated as 110mW. Power estimation without I/O pads yields a value of 16mW. With the insertion of DFT circuits reported power is shown below. It can be clearly observed that power consumption increased more than twice. Even though DFT circuits don’t contribute anything to normal functionality of the design, they consume static power. Hence when it comes to larger designs, keeping design trade-off in mind, DFT has to be implemented carefully. However, total power consumption is comparatively low. This can be attributed to less number of inputs and outputs in the design. SAMM design has total 6 inputs and 3 outputs; each is of width 10 bits.
****************************************
Report : power
-analysis_effort low
Design : sam3
Version: V-2004.06-SP1
Date : Mon Apr 23 16:26:09 2007
****************************************
Global Operating Voltage = 1.08
Power-specific unit information :
Voltage Units = 1V
Capacitance Units = 1.000000pf
Time Units = 1ns
Dynamic Power Units = 1mW (derived from V,C,T units)
Leakage Power Units = 1pW
Cell Internal Power = 284.3075 mW (99%)
Net Switching Power = 2.1465 mW (1%)
---------
Total Dynamic Power = 286.4540 mW (100%)
Cell Leakage Power = 38.3158 uW
9 Tool automation
Automating the whole process of synthesis and report generation is very important to minimize the user effort. The commands which are used in the previous sections for different analysis of the design can be written as a script file with .scr extension and can be called by the command source filename.scr to execute all commands by the tool one by one itself, without waiting for user interference. Commands used in the design SAMM are stored in sam3scr.scr script file and given below.
read_file -format verilog {{/home/students/student/murali_ft06/sam3.v}}
#analyze -format verilog {{/home/students/student/murali_ft06/sam3.v}}
#elaborate -format verilog {{/home/students/student/murali_ft06/sam3.v}}
create_clock -period 4.75 clock
set_clock_uncertainty -setup 0.475 clock
set_clock_uncertainty -hold 0.27 clock
set_clock_latency 0.45 clock
set_clock_latency -source 0.45 clock
set_clock_transition 0.04 clock
set_input_delay 0.40 [all_inputs]
remove_input_delay [get_ports clock]
set_output_delay 0.40 [all_outputs]
set_port_is_pad
insert_pad
check_design
compile -map_effort high
report_area
report_power
report_timing
report_timing -delay min
report_constraint -all_violators
report_cell
set hdlin_enable_rtldrc_info true
set test_enable_dft_drc true
rtldrc
create_test_clock clock -waveform {25 50}
#set_scan_configuration -longest_chain_length 100
create_test_protocol
dft_drc
insert_scan
report_test -scan_path
report_timing
set_case_analysis 0 test_se
report_timing
report_power
estimate_test_coverage
report_cell
#report_hierarchy
ungroup -start_level 1 -all
write -hierarchy
write -hierarchy -format verilog -output /home/students/student/murali_ft06/sam3veri.v
It is assumed that required link library and target library are set and also RTL description of the design is stored in the working directory. First current design is read from the working directory. Then all constraints are written. Then compilation command is executed. After the compilation several report commands are written to obtain the required reports of the design. DFT related all commands are then included. Design related reports are obtained after scan insertion to compare the performance. At last the netlist is stored both in .db and .v format. Before DFT also netlist is stored for the purpose of formal verification.
10 Conclusion
A 3x3 Systolic Array Matrix Multiplication is implemented. RTL description of the design is simulated using the tool Modelsim and then synthesized using the EDA tool Design Compiler (DC). Static timing analysis of the design is carried out using the tool PrimeTime and the positive setup slack of 0.8 is achieved for clock period of 4.75 (~210 MHz). The slack comes out to be nearly equal to 16.84% of the clock period. The gate level netlist and DFT enabled netlist is formally verified using the tool Formality. More restricted constraint and further effort on optimization technique can result in higher clock frequency. As a continuation of the work, the design can be synthesized and analyzed using 90 nm libraries and a comparison study with the result of present design can be carried out. The generated netlist can be taken back to the Modelsim and can be simulated with the help of supporting libraries to verify the functionality. We can appreciate and understand the ASIC design methodology by implementing the same design in another EDA tool RTLcompiler and comparison study with the result of present design can be carried out.
Reference:
[1] shaaban, Systolic Architectures, #1 lec # 1 Spring, http://www.cs.hmc.edu/courses/2001/spring/cs156/, 3/4/07
[2] Himanshu Bhatnagar, Advanced ASIC chip Syntheis Using Synopsys Design Compiler, Physical Compiler and PrimeTime, Kluwer Academic Publishers, Second edition, 2002
[3] Design Compiler® User Guide, Version X-2005.09, September 2005
nice.....continue...
ReplyDeleteThis technique is good for matrices of small dimensions. But for large matrices of the order of 512x512 or 1024x1024 for example, systolic array especially two dimensional approach which you used will consume all the
ReplyDeleteresources If implemented on FPGA. It will use lot of IOs etc...........
Can you shed some light on how to multiply large matrices on FPGA with an example of VHDL/Verilog code.
One thing which can be done is to store the coefficient of Matrix A and Matrix B on on-chip RAM and store the result matrix C also in memory.
I appreciate your interest about the asic soc blog and thank you for your valuable feedback.
ReplyDeleteyes..as you said systolic technique for larger size will take more number of resources. Hence we need to use memory resources of FPGA. I will try to publish articles related to that. Let me prepare !
It is very good article.
ReplyDeleteIf someone very new to ASIC Design flow, this article is very useful to them.
You are showing us step-by-step of ASIC Design flow. Also, you add very good explanation for each step.
I'll definitely recommend this article and this blog to anyone whose interested in ASIC Design.
Thanks man.
Thanks a lot for ur comment..... I am putting my best effort to improve the quality of this blog.
ReplyDeletehappy reading...!
Hi, Very good article.. Iam new to array processors and thanks to your blog it real helps.. keep blogging... :)
ReplyDelete