Its all about VLSI, ASIC, SoC: CMOS Design, Layout, Digital Design, Verilog HDL,Synthesis,Static Timing Analysis (STA),Design For Test(DFT),Physical Design,Floorplanning,Power Planning,Clock Tree Synthesis (CTS),Placement,Routing,Physical Verification,Formal Verification....!
OpenSPARC is free 64 bit processor provided by Sun Microsystems. They are available in two flavours:
A 64-bit, 32 Thread Chip Multithreaded Microprocessor
A 64-bit, 64 Thread Chip Multithreaded Microprocessor
These processors (RTL source files) can be downloaded from OpenSPARC website.
Quoting directly from OpenSPARC website:
"This download area is for hardware design and verification engineers, it includes
* Verilog RTL for OpenSPARC T1 design * Verification environment for OpenSPARC T1 * Diagnostics tests for OpenSPARC T1 * Scripts and Sun internal tools needed to simulate the design and to do synthesis of the design * Open source tools needed to simulate the design * Scripts and documentation to help with FPGA implementation of parts of OpenSPARC T1 design including
SPARC core, Floating point Unit, Cross-bar"
Download OpenSPARC related all source codes, documents, related tools from here. (registration required)
Well......verification can be carried out with the help of verilog models etc. What about synthesis and other backend process? where to get free timing and physical libraries?
These are common queries we encounter often. I stumbled across a comany webpage, Nangate, recently which is providing 45nm Open Cell Library !!
Quoting from their website,
"The Nangate 45nm Open Cell Library is an open-source, standard-cell library provided for the purposes of testing and exploring EDA flows.
Nangate has developed and donated this library to Si2.org for open use......."
"The 45nm Open Cell Library contains the following views:
* Liberty (.lib) formatted libraries with CCS Timing, ECSM Timing and NLDM/NLPM data (fast, slow and typical corners) * Geometric library in Library Exchange Format (LEF) * Simulation libraries in Verilog and Spice (pre and post parasitic extracted netlists) * Cell layouts in GDSII * Schematics * Library databook in HTML/XML format * OpenAccess database containing layouts and netlists"
Download Nangate 45nm Open Cell Library from here. (registration required)
Note: The SPARC processor has several cache memories (nothing but SRAMs in general). Designers have to arrange memory libraries themselves. Memory libraries can also be generated from memory compilers. I personally haven't implemented design flow of the OpenSPARC processor. I even haven't tested the completeness of the Nangate 45nm open cell library. Share your experiences of these open source projects, if anybody has gone through the pain of designing ! Read the rest of this article >>
Timing path is defined as the path between start point and end point where start point and end point is defined as follows:
Start Point:
All input ports or clock pins of a sequential element are considered as valid start point.
End Point:
All output port or D pin of sequential element is considered as End point.
For STA design is split into different timing path and each timing path delay is calculated based on gate delays and net delays. In timing path data gets launched and traverses through combinational elements and stops when it encounter a sequential element. In any timing path, in general (there are exceptions); delay requirements should be satisfied within a clock cycle.
In a timing path wherein start point is sequential element and end point is sequential element, if these two sequential elements are triggered by two different clocks(i.e. asynchronous) then a common least common multiple (LCM) of these two different clock periods should be considered to find the launch edge and capture edge for setup and hold timing analysis.
Different Timing Paths
Any synchronous design is split into various timing paths and each timing path is verified for its timing requirements. In general four types of timing paths can be identified in a synchronous design. They are:
Input to Register
Input to Output
Register to Register
Register to Output
Input to Output:
It starts at input port and ends at output port. This is pure combinational path. You can hardly find this in a synchronous design.
Input to Register:
Semi synchronous; Register is controlled by the clock. Input data can come at any time.
Register to Register:
Purely sequential; both starting and ending flops are controlled by the clock.
Register to Output:
Data can come at any point of time.
Clock path
The path wherein clock traverses is known as clock path. Clock path can have only clock inverters and clock buffers as its element. Clock path may be passed trough a “gated element” to achieve additional advantages. In this case, characteristics and definitions of the clock change accordingly. We call this type of clock path as “gated clock path”. The process of “clock gating” has main advantage of dynamic power saving.
Data path
The path wherein data traverses is known as data path. Data path is a pure combinational path. It can have any basic combinational gates or group of gates.
Launch path
Launch path is part of clock path. Launch path is launch clock path which is responsible for launching the data at launch flip flop.
Launch path and data path together constitute arrival time of data at the input of capture register.
Capture path
Capture path is part of clock path. Capture path is capture clock path which is responsible for capturing the data at capture flip flop.
Capture clock period and its path delay together constitute required time of data at the input of capture register.
Design of peripheral and supporting circuits such as sense amplifier, address decoders, precharge and I/O control circuits are very important for the proper functioning of SRAM. The memory cell has to be accessed by all these supporting circuits by the help of BL and BLbar lines. Address decoders select a particular cell for read/write operation. Address decoding delay account for the maximum part of the memory access time in addition to the delay provided by the bit line capacitances of the memory cell itself. Read and write circuits provide an interface between internal memory cells to the external hardware facilitating proper data transfer between them. Before any layout is designed for all these blocks they have to be tested for functionality and worst case possibilities to make them error free design.
2 Sense amplifiers
Since SRAM cells provide true differential outputs any differential configuration of sense amplifier is directly applied to SRAM design. One such type of configuration is shown in Figure (2.1). Sense amplifier is a latch formed by cross coupling two CMOS inverters. Sense enable (SE) signal is used to turn ON/OFF the sense amplifier BL and BLbar becomes I/O terminals of amplifier. During read operation, if cell had stored 1, then a small +ve voltage will develop between BL and BLbar with VBL>VBLbar. Then amplifier raises voltage VBL to VDD and VBLbar to 0V. This output is then directed to the chip I/O pin by the column decoder.
Figure (1) sense amplifier
Sense amplifier performs the following functions:
àAmplification: small bit line swings are resolved by the sense amplifier. This reduces power dissipation.
àReduction in delay: by accelerating the bit line transitions sense amplifier boosts the driving capability of the SRAM cell.
àReduction in power dissipation: this is achieved by reducing large signal swing on the bit line eliminating the necessity to charge or discharge the bit line capacitance.
Simulation: SPICE simulation results of the sense amplifier for the schematic shown in Figure (1) is shown in Figure (2).
Figure (2) sense amplifier SPICE simulation waveform
Initially sense enable (SE) signal is deactivated. The inputs BL and BLbar lines are precharged and equalized to metastable point of the inverter. Initialization of read operation causes any one of the bit lines to drop. Once the sufficient amount of differential voltage is established SE signal is activated. The cross coupled inverters of the amplifier reaches to a stable operation point after the result of the positive feedback.
Sharing of the single sense amplifier between multiple columns can save area as well as power. Also by pulsing SE signal for short duration of evaluation reduces the static power the amplifier.
Normal W/L ratios are selected for NMOS and PMOS transistors. PMOS transistors have a W/L ratio of 6.66 which means that for 0.18 µ technology gate width of 1.2 µ. For NMOS transistors this ratio is 3.33 that are to say a gate width of 0.6 µ.
Simulation results are shown in Figure (2). Here sense amplifier is nothing but a differential amplifier. Node Y of the amplifier is forced with a pulse waveform. When the SE is activated, due to the differential configuration, BLbar shows complementary waveform of BL as shown by the circled area in simulation waveform. Further analysis is carried out along with SRAM cell and precharge circuits.
3 Precharge and Equalization Circuit
The precharge and equalization circuit is shown in Figure (3)
Figure (3) Precharge circuit and simulation setup
When precharge enable (PE) goes high prior to read operation, all three transistors conduct. M1 and M2 precharge the BL and BLbar to VDD/2. M3 helps to speed up this process by equalizing the initial voltages on the two lines. This equalization is critical to the proper operation of sense amplifier. Sense amplifier can erroneously interpret the any voltage difference present between BL and BLbar prior to the commencement of read operation.
Read operation sequence:
1. When precharge enable (PE) signal is made high both BL and BLbar precharges to VDD/2. Then PE is made low. This causes BL and BLbar to float for a small interval of time.
2. When word line is activated then voltage difference is established between BL and BLbar. If cell had stored 1, then VB>VBbar. If cell had stored 0, then VBBbar.
3. Now sense enable (SE) signal is activated. This turns ON the sense amplifier. Positive feedback structure of the sense amplifier establishes stable condition within a short time.
4 Half VDD generator
Half VDD sensing scheme has two advantages: it improves noise immunity and it has lower power consumption.
Figure (4) Half VDD generator
The basic circuit of half VDD generator consists of bias circuit and a driver circuit as shown in the Figure (4). The (W/L) ratio of the bias circuit transistors is set so that the voltage at the node B is VDD/2. Therefore voltage at node A is VDD/2+VTN (VTN-threshold voltage of NMOS transistor) and at node C is VDD/2-|VTP| (VTP-threshold voltage of PMOS transistor).The output voltage of the driver is stabilized at VDD/2. Static current of the driver circuit is very low due to poor ON state of driver transistors. Driver stage is in push pull configuration. (W/L) ratio of the driver transistors are made larger to suppress any unexpected change at the output node quickly by turning ON either transistor strongly.
Address decoder is required to select one of the 2M rows or columns in response to an M bit address input. A simple NOR based matrix structure fulfills this requirement. A 3x8 decoder used to decode 8 memory blocks is shown in the Figure (6). A PMOS is attached to each line. When there is no read write operations PEbar signal is kept high. Because of this arrangement the decoder circuit does not dissipate static power. NOR based decoders use less number of devices compared to normal decoder implementation methodology. Layout of such decoder is time consuming and cumbersome compared to NOR based implementation.
In the case of row decoder, PMOS is activated by precharge control signal PEbar prior to the address decoding process. All word line (WL) is pulled high to VDD during precharge. Column (or block) decoders have to provide the discharge path from the precharged bit line to the sense amplifier during read operation. The same lines should be able to drive the bit line to write either 0 or 1 to the memory SRAM cell. Read and write access time of the memory is primarily restricted by the propagation delay of the decoder. Floor plan of the decoder should be carefully studied before the layout implementation of the row and column decoders. Decoder outputs are connected throughout the memory cell making long interconnections which are main resources of delay and higher power consumption.
Generally NOR based decoders improves the speed of operation and achieve power efficiency. Larger the PMOS transistor, the faster is the pre-charging and so faster is the decoder. For 0.18 µ technology gate width of all NMOS transistors in both row and column decoders are selected as 0.6 µ. For PMOS transistors gate width is 1.2 µ.
5.1 Column decoder
In this SRAM design each block is connected as one column. Each block consists of 8 sub columns and 128 rows. BL and BLbar lines of the sub column have column enable transistors which are enabled or disabled by the output of 3x8 decoder.
Figure (6) 3x8 column decoder
At present buffer drivers for decoder outputs are not considered. But, due to the large capacitance offered by the column and row connections (more evident in row decoder) a buffer circuit may be necessary before the signal reaches column control transistors of each sub column.
Figure (7) 3x8 decoder SPICE simulation waveform
The SPICE simulation waveform is shown in Figure (7). Inputs A0 to A2 and complement of these are applied appropriately as per the NOR logic. (In the waveform all signals are named in small case). The outputs of the decoder C0 to C7 are highlighted by circles. False triggering of decoder output occurs due to the rise time and fall time of the address line signals. This can be counteracted by proper control of address inputs and DEbar signal.
5.2 Row decoder
7x128 row decoder schematic is extension of 3x8 decoder. The discussion on capacitance and false triggering holds good here as well. The corresponding SPICE simulation waveform is shown in Figure (8).
Address inputs A3, A6 and A9 are shown in the waveform. Simulation waveforms of only six outputs out of 128 are shown. (In simulation waveform signals are named in small case). They are R0, R1, R63, R64, R126 and R127and are highlighted by the circles and arrows. For A3 A9 =0, R0 is selected and A3 to A9=127 R127 is selected.
6 I/O control circuits
I/O control circuits are integral part of the memory circuit. They interface internal memory cells with the external world. Generally internal operation of the cell runs in lower voltage range compared to the external world power supply of the chip. In such cases to resolve compatibility issues I/O circuits become essential. Here in this section read write circuits and buffer design for SRAM is presented.
6.1 Read buffer
Gate level and transistor level schematic is shown in Figure (9). Corresponding truth table of the circuit is listed in Table (1). Read enable (RE) signal is given as common input to two NAND gate while DL and DLbar becomes other two inputs for the gate. Push pull configuration of transistors finally drive the DIO line which is externally available for the chip. Basic NAND gate design strategy is used to design transistors. All the transistors of the NAND gate has common W/L ratio. PMOS transistor M8 and M10 of inverters have twice the width of M9 and M11.
Transistors M10 and M11 form driver circuit which interface to the DIO line of the chip. Power supply to this driver is directly given from the external power supply of the chip so that logic levels are compatible to the external interface unit.
6.2 Write circuit
Write circuit should be able to force the BL and BLbar line to change its state as per the given input data by charging the large bit line capacitances instantaneously. Hence write circuit is designed with NOR gates to provide higher current driving capability. Gate level and transistor level schematic is shown in Figure (10). The circuit resembles the read circuit with NAND gate replaced by NOR gates. Write enable (WE) signals control the write operation. Output of each NAND gate is driven by NMOS transistor having higher W/L ratio. These two transistors drive DL and DLbar lines and hence BL and BLbar lines. For the NMOS transistors of the NOR gate W/L ratio of 3.33 (i.e. W=0.6 µ) is selected and for PMOS transistors W/L ratio of 12 (i.e W=7.2 µ) is selected.W/L ratio for driving transistors M9 and M10 is selected to be 6.66 which makes gate width of 1.2 µ.
Table (2) write circuit truth table
6.3 Write buffer
Figure (11) write buffer-gate and transistor level
Write buffer shown in the Figure (11) is essential to interface DIO line to the write circuitry. External DIO line is given to the first inverter stage of buffer. Buffer draws power from internal power supply line VDD. Second stage output of buffer hence becomes compatible to internal logic levels of the chip.
7 Complete SRAM chip schematic
As we seen earlier complete SRAM has total 8 blocks and in each block cells are arranged in 128x8 matrix structure. Consider the Figure (12) wherein one block of memory is shown. Row select lines are given to R0 to R127 from row decoder output. Since whole block is considered as one column for parallel configuration of read and write operation, single column line activates individual sub column select transistors. Thus column decoder output C0 drives first block, C1 drives second block and so on till C7 drives 8th block. Row decoder output R0 to R127 is connected to all memory blocks.
Total 8 read and write circuits are sufficient for read and write operation. As shown in the Figure (12) read and write circuit I/Os are connected to BL and BLbar lines of sub columns. The read and write circuit connected to first sub column of first memory block, also connects to first sub column of second memory block, third memory block and so on. Similarly the second read write is connected to second sub column of all the blocks and this arrangement continues for all other sub columns of memory blocks. All these 8 set of read and write circuits are active at a given point of time to access any memory locations arranged in any row of any memory block which is decided by address decoders.
Figure (12) schematic of memory block
Access time difference of this parallel architecture and the architecture wherein individual memory bits are accessible, have to be studied. Nonetheless, in both architectures delay contributed by the address decoders play vital role. The overall switched capacitance can be reduced by dividing the word line into several sub word lines that are enabled while addressing. Similarly capacitances of bit line for every read-write operation can be reduced by partitioning of the memory.
8 Conclusions
Different supporting circuits like sense amplifier address decoders and I/O circuits are designed and analyzed by the help of SPICE simulation waveforms. Individual circuit performance is found to be satisfactory and its performance with the SRAM memory cell has been reported in the previous chapter. Quantitative analysis of all these circuits proved their functionalities. The range of difference voltage which sense amplifier can interpret original logic levels and time required to sense this difference has to be studied. Similarly capacitance and hence the delay offered by the decoder circuits to decode the input has to be analyzed. These will help in designing accurate layout of supporting circuits and thereby facilitating with the easy integration of these modules into the SRAM memory layout.
Bibliography
[1] Sung Mo Kang and Yusuf Leblebici, CMOS digital integrated circuits-analysis and design, Tata McGraw hill, third edition, 2003
[2] Jan M Rabaey & Anantha Chandrakasan & Borivoje Nikolic, Digital integrated circuits-a design perspective, Pearson education, third edition, 2005
"CoreConnect" and "AMBA" are the two prominent bus architectures used in System on Chip designs. These architectures define technology independent standard bus protocol methodologies for easy integration of IPs within a System on Chip design. "CoreCOnnect" is mainly developed by IBM and integral part of PowerPC processor based System on Chip designs. CoreConnect bus architecture has three parts:
"AMBA" stands for "Advanced Microcontroller (Microprocessor) Bus Architecture". AMBA specifiation is developed by ARM and extensively used in ARM based System on Chip designs.
AMBA has different versions as listed below from the lowest version:
lock Definitions: Rising and falling edge of the clock
For a +ve edge triggered design +ve (or rising) edge is called ‘leading edge’ whereas –ve (or falling) edge is called ‘trailing edge’.
For a -ve edge triggered design –ve (or falling) edge is called ‘leading edge’ whereas +ve (or rising) edge is called ‘trailing edge’.
basic clock
Minimum pulse width of the clock can be checked in PrimeTime by using commands given below:
set_min_pulse_width -high 2.5 [all_clocks]
set_min_pulse_width -low 2.0 [all_clocks]
These checks are generally carried out for post layout timing analysis. Once these commands are set, PrimeTime checks for high and low pulse widths and reports any violations.
Capture Clock Edge
The edge of the clock for which data is detected is known as capture edge.
Clock Definitions:
Launch Clock Edge
This is the edge of the clock wherein data is launched in previous flip flop and will be captured at this flip flop.
launch clock and capture clock
Skew
Skew is the difference in arrival of clock at two consecutive pins of a sequential element is called skew. Clock skew is the variation at arrival time of clock at destination points in the clock network. The difference in the arrival of clock signal at the clock pin of different flops.
Two types of skews are defined: Local skew and Global skew.
Local skew
Local skew is the difference in the arrival of clock signal at the clock pin of related flops.
Global skew
Global skew is the difference in the arrival of clock signal at the clock pin of non related flops. This also defined as the difference between shortest clock path delay and longest clock path delay reaching two sequential elements.
local and global skew
Skew can be positive or negative. When data and clock are routed in same direction then it is Positive skew. When data and clock are routed in opposite direction then it is negative skew.
Positive Skew
If capture clock comes late than launch clock then it is called +ve skew.
Clock and data both travel in same direction.
When data and clock are routed in same direction then it is Positive skew.
+ve skew can lead to hold violation.
+ve skew improves setup time.
positive skew negative skew
Negative Skew
If capture clock comes early than launch clock it is called –ve skew. Clock and data travel in opposite direction. When data and clock are routed in opposite then it is negative skew. -ve skew can lead to setup violation. -ve skew improves hold time. (Effects of skew on setup and hold will be discussed in detail in forthcoming articles)
Uncertainty
Clock uncertainty is the time difference between the arrivals of clock signals at registers in one clock domain or between domains.
Pre-layout and Post-layout Uncertainty
Pre CTS uncertainty is clock skew, clock Jitter and margin. After CTS skew is calculated from the actual propagated value of the clock. We can have some margin of skew + Jitter.
timing diagram depicting skew, latency, jitter
Clock Definitions:
Clock latency
Latency is the delay of the clock source and clock network delay.
Clock source delay is the time taken to propagate from ideal waveform origin point to clock definition point. Clock network latency is the delay from clock definition point to register clock pin.
Pre CTS Latency and Post CTS Latency
Latency is the summation of the Source latency and the Network latency. Pre CTS estimated latency will be considered during the synthesis and after CTS propagated latency is considered.
Source Delay or Source Latency
It is known as source latency also. It is defined as "the delay from the clock origin point to the clock definition point in the design".
Delay from clock source to beginning of clock tree (i.e. clock definition point).
The time a clock signal takes to propagate from its ideal waveform origin point to the clock definition point in the design.
Network Delay (latency) or Insertion Delay
It is also known as Insertion delay or Network latency. It is defined as "the delay from the clock definition point to the clock pin of the register".
The time clock signal (rise or fall) takes to propagate from the clock definition point to a register clock pin.
Figure below shows example of latency for a design without PLL.
latency for a design without PLL
Clock Definitions:
The latency definitions for designs with PLL are slightly different.
Figure below shows latency specifications of such kind of designs.
Latency from the PLL output to the clock input of generated clock circuitry becomes source latency. From this point onwards till generated clock divides to flops is now known as network latency. Here we can observe that part of the network latency is clock to q delay of the flip flop (of divide by 2 circuit in the given example) is known value.
latency for a design with PLL
Clock Definitions:
Jitter
Jitter is the short-term variations of a signal with respect to its ideal position in time.
Jitter is the variation of the clock period from edge to edge. It can vary +/- jitter value.
From cycle to cycle the period and duty cycle can change slightly due to the clock generation circuitry. Jitter can also be generated from PLL known as PLL jitter. Possible jitter values should be considered for proper PLL design. Jitter can be modeled by adding uncertainty regions around the rising and falling edges of the clock waveform.
Sources of Jitter Common sources of jitter include:
If more than one clock is used in a design, then they can be defined to have different waveforms and frequencies. These clocks are known as multiple clocks. The logics triggered by each individual clock are then known as “clock domain”.
If clocks have different frequencies there must be a base period over which all waveforms repeat.
Base period is the least common multiple (LCM) of all clock periods
Asynchronous Clocks
In multiple clock domains, if these clocks do not have a common base period then they are called as asynchronous clocks. Clocks generated from two different crystals, PLLs are asynchronous clocks. Different clocks having different frequencies generated from single crystal or PLL are not asynchronous clocks but they are synchronous clocks.
Gated clocks
Clock signals that are passed through some gate other than buffer and inverters are called gated clocks. These clock signals will be under the control of gated logic. Clock gating is used to turn off clock to some sections of design to save power. Click here to read more about clock gating.
Generated clocks
Generated clocks are the clocks that are generated from other clocks by a circuit within the design such as divider/multiplier circuit.
Static timing analysis tools such as PrimeTime will automatically calculate the latency (delay) from the source clock to the generated clock if the source clock is propagated and you have not set source latency on the generated clock.
generated clock
Clock Definitions:
‘Clock’ is the master clock and new clock is generated from F1/Q output. Master clock is defined with the constraint ‘create_clok’. Unless and until new generated clock is defined as ‘generated clock’ timing analysis tools won’t consider it as generated clock. Hence to accomplish this requirement use “create_generated_clock” command. ‘CLK’ pin of F1 is now treated as clock definition point for the new generated clock. Hence clock path delay till F1/CLK contributes source latency whereas delay from F1/CLK contributes network latency.
Virtual Clocks
Virtual clock is the clock which is logically not connected to any port of the design and physically doesn’t exist. A virtual clock is used when a block does not contain a port for the clock that an I/O signal is coming from or going to. Virtual clocks are used during optimization; they do not really exist in the circuit.
Virtual clocks are clocks that exist in memory but are not part of a design. Virtual clocks are used as a reference for specifying input and output delays relative to a clock. This means there is no actual clock source in the design. Assume the block to be synthesized is “Block_A”. The clock signal, “VCLK”, would be a virtual clock. The input delay and output delay would be specified relative to the virtual clock.
Transition delay or slew is defined as the time taken by signal to rise from 10 %( 20%) to the 90 %( 80%) of its maximum value. This is known as “rise time”.
Transition Delay or Slew
Similarly “fall time” can be defined as the time taken by a signal to fall from 90 %( 80%) to the 10 %( 20%) of its maximum value.
Transition is the time it takes for the pin to change state.
Setting Transition Time Constraints
The above theoretical definitions are to be applied on practical designs. Now, the transition time of a net becomes the time required for its driving pin to change logic values (from 10 %( 20%) to the 90 %( 80%) of its maximum value). This transition time used foe delay calculations are based on the timing library (.lib files).
Transition related constraints can be provided in Design Compiler (logic synthesis tool from Synopsys) by using below commands:
1. max_transition : This attribute is applied to each output of a cell. During optimization, Design Compiler tries to make the transition time of each net less than the value of the max_transition attribute.
2. set_max_transition: This command is used to change the maximum transition time restriction specified in a technology library.
“This command sets a maximum transition time for the nets attached to the identified ports or to all the nets in a design by setting the max_transition attribute on the named objects.
For example, to set a maximum transition time of 3.2 on all nets in the design adder, enter the following command:
set_max_transition 3.2 [get_designs adder]
To undo a set_max_transition command, use the remove_attribute command. For example, enter the following command:
(Directly quoted from Design Complier user manual)
Setting Capacitance Constraints
The transition time constraints specified above do not provide a direct way to control the actual capacitance of nets. To control capacitance directly, below command has to be used:
set_max_capacitance: This command sets the maximum capacitance constraint on input ports or designs.
In addition to set_max_transition, set_max_capacitance can also be used as this command works independent.
This command applies maximum capacitance limit to output pin or port of the design.
This command can also be used to apply capacitance limit on any net.
Eg:
set_max_capacitance 4 [get_designs decoder]
To remove the set_max_capacitance command, use the remove_attribute command.
Propagation delay is the time required for a signal to propagate through a gate or net.
Hence if it is cell, you can call it as “Gate or Cell Delay” or if it is net you can call it as “Net Delay”
Propagation delay of a gate or cell is the time it takes for a signal at the input pin to affect the output signal at output pin.
For any gate propagation delay is measured between 50% of input transition to the corresponding 50% of output transition.
There are 4 possibilities:
Propagation delay between 50 % of Input rising to 50 % of output rising.
Propagation delay between 50 % of Input rising to 50 % of output falling.
Propagation delay between 50 % of Input falling to 50 % of output rising.
Propagation delay between 50 % of Input falling to 50 % of output falling.
Each of these delays has different values. Maximum and minimum values of these set are very important. Maximum and minimum propagation delay values are considered for timing analysis.
For net propagation delay is the delay between the time a signal is first applied to the net and the time it reaches other devices connected to that net.
Propagation delay is taken as the average of rise time and fall time i.e. Tpd= (Tphl+Tplh)/2.
Propagation delay depends on the input transition time (slew rate) and the output load. Hence two dimensional look up tables are used to calculate these delays. How to calculate propagation delay of net and gate? Please refer below articles to find the detailed explanation.
Net Delay or Interconnect Delay or Wire Delay or Extrinsic DelaNet delay is the difference between the time a signal is first applied to the net and the time it reaches other devices connected to that net.
It is due to the finite resistance and capacitance of the net. It is also known as wire delay.
Wire delay = function of (Rnet, Cnet+Cpin)
This is output pin of the cell to the input pin of the next cell.
Net delay is calculated using Rs and Cs.
There are several factors which affect net parasitic:
Net Length
Net cross-sectional area
Resistively of material used for metal layers (Aluminum vs. copper)
Number of vias traversed by the net
Proximity to other nets (crosstalk)
Post-layout design is annotated with RCs extracted from layout for better accuracy. Annotated RCs override information from WLM.
Interconnect introduces capacitive, resistive and inductive parasites. All three have multiple effects on the circuit behavior.
Interconnect parasites cause an increase in propagation delay (i.e. it slows down working speed)
Interconnect parasites increase energy dissipation and affect the power distribution.
Interconnect parasites introduce extra noise sources, which affect reliability of the circuit. (Signal Integrity effects)
Dominant parameters determine the circuit behavior at a given circuit node. Non-dominant parameters can be neglected for interconnect analysis.
Inductive effect can be ignored if the resistance of the wire is substantial enough-this is the case for long aluminum wires with a small cross section or if the rise and fall times of the applied signals are slow.
When the wires are short, the cross section of the wire is large or the interconnect material used has a low resistivity, a capacitive only model can be used.
When the separation between neighboring wires is large or when the wires only run together for short distance, inter-wire capacitance can be ignored, and all the parasitic capacitance can be modeled as capacitance to ground.
Capacitance can be modeled by the parallel plate capacitor model.
C = (ε / t).WL
Where
ε --> permittivity of dielectric material (SiO2)
t --> thickness of dielectric material (SiO2)
W --> width of wire
L --> length of wire
ε --> εr εo where εr --> relative permittivity of SiO2
εo --> 8.854 x 10-12 F/m; permittivity of free space
As technology node shrinks (scaling), to minimize resistance of the wires, it is desirable to keep the cross section of the wire (WxH) as large as possible. But this increases area. Small values of W lead to denser wiring and less area overhead. In advanced process W/H ratio has reduced below unity. Under such circumstances parallel plate capacitance model becomes inaccurate. The capacitance between the sidewall of the wires and substrate called fringing capacitance can no longer be ignored and contributes to the overall capacitance.
Net Delay or Interconnect Delay or Wire Delay or Extrinsic Dela
Inter-wire capacitance become dominant factor in multilayer interconnect structures. These floating capacitors (not connected to substrate or ground) form a source of noise (cross talk). This effect is more pronounced for wires in the higher interconnect layer, as these are farther away from the substrate.
Generally higher metal layers (i.e. interconnects) have higher thickness (i.e. height) and higher dielectric layers have higher permittivity. Hence these wires display the highest inter-wire capacitance. Hence use it for global signals that are not sensitive to interference. (eg. Supply rails). Or it is advisable to separate wires by an amount that is larger than minimum spacing.
Since H (height, thickness) is constant for a given technology we can write: R = Rs.(L/W) where Rs=ρ/H ohm/sqare is called “sheet resistance”.
At very high frequencies “skin effect” comes into play such that the resistance becomes frequency dependent. High frequency currents tend to flow primarily on the surface of a conductor, with the current density falling off exponentially with depth into the conductor.
Skin effect is only an issue for wider wires. Since clocks tends to carry the highest frequency signals on a chip and also fairly wide to limit resistance, the skin effect likely to have its first impact on these lines.
With the adoption of low resistance interconnect materials and the increase of switching frequencies to GHz range, inductance starts to an important role. Consequences of on chip inductance include ringing and overshoot effect, reflection of signals due to impedance mismatch, inductive coupling between lines, and switching noise due to (Ldi/dt) voltage drops.
As long as the resistive component of the wire is small, and switching frequencies are in the low to medium range, it is meaningful to consider only the capacitive component of the wire, and to lump the distributed capacitance into a single capacitance.
Net Delay or Interconnect Delay or Wire Delay or Extrinsic Dela
The only impact on performance is introduced by the loading effect of the capacitor on the driving gate.
If wire length is more than a few millimeters, the lumped capacitance model is inadequate and a resistive capacitive model has to be adopted.
In lumped RC model the total resistance of each wire segment is lumped into one single R, combines the global capacitive into single capacitor C.
Analysis of network with larger number of R and C becomes complex as network contains many time constants (zeroes and poles). Elmore delay model overcome such problem.
“Path resistance” is the resistance from source node to any other node.
“Shared path resistance” is the resistance shared among the paths from the source node to any other two nodes.
Hence,
Delay at node 1: Tow d1 = R1C1
Delay at node 2: Tow d2= (R1+R2)C2
Delay at node 3: Tow d3 = (R1+R2+R3)C3
In general:
τdi=R1C1+(R1+R2)C2+……..+(R1+R2+R3+…..+Ri)Ci
If
R1=R2=R3=….=R
C1=C2=C3=…..C then
τdi=RC+2RC+……..+nRC
Thus Elmore delay is equivalent to the first order time constant of the network.
Assuming an interconnect wire of length L is partitioned into N identical segments. Each segment has length L/N.
Then,
τd=L/N.R.L/N.C+ 2 (L/n.r+L/N.C)+……
=(L/N)2(RC+2RC+…….+NRC)
=(L/N)2. N(N+1)
or τd=RC.L2/2
=> The delay of a wire is a quadratic function of its length
=> doubling the length of the wire quadruples its delay
Advantages
It is simple
It is always situated between minimum and maximum bounds
Disadvantages
It is pessimistic and inaccurate for long interconnect wires.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Distributed RC model
Lumped RC model is always pessimistic and distributed RC model provides better accuracy over lumped RC model.
But distributed RC model is complex and no closed form solution exists. Hence distributed RC line model is not suitable for Computer Aided Design Tools. The behavior of the distributed RC line can be approximated by a lumped RC ladder network such as Elmore Delay model hence these are extensively used in EDA tools.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Transmission Line Model
When frequency of operation increases to a larger extent, rise (or fall) time of the signal becomes comparable to time of flight of the net, then inductive effects starts dominating over RC values.
This inductive effect is modeled by Transmission Line models. The model assumes that the signal is a "wave" and it propagates over the medium "net".
There are two types of transmission models: Lossless transmission line model:This is good for Printed Circuit Board level design.
Lossy transmission line model:This model is used for IC interconnect model.
Transmission line effects should be considered when the rise or fall time of the input signal is smaller than the time of flight of the transmission line or resistance of the wire is less than characteristics impedance. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Wire Load Models
Extraction data from already routed designs are used to build a lookup table known as the wire load model (WLM). WLM is based on the statistical estimates of R and C based on “Net Fan-out”.
For fanouts greater than those specified in a wire load table, a “slope factor” is specified for linear extrapolation.
wire_load (“5KGATES”) {
resistance : 0.000271 -------------> R per unit length
capacitance : 0.00017 -------------> C per unit length
slope : 29.4005 ---------------------> Used for linear extrapolation
Net length = 135.98 + 2 x 29.4005 (slope) = 194.78 ----------> length of net with fanout of 7
Resistance = 194.78 x 0.000271 = 0.05279 units Capacitance = 194.78 x 0.00017 = 0.03311 units
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Wire load models for synthesis
Wire load modeling allows us to estimate the effect of wire length and fanout on the resistance, capacitance, and area of nets. Synthesizer uses these physical values to calculate wire delays and circuit speeds. Semiconductor vendors develop wire load models, based on statistical information specific to the vendors’ process. The models include coefficients for area, capacitance, and resistance per unit length, and a fanout-to-length table for estimating net lengths (the number of fanouts determines a nominal length).
Selection of wire load models in the initial stage (before physical design) depends on the fallowing factors:
1. User specification
2. Automatic selection based on design area
3. Default specification in the technology library
Once the final routing step is over in the physical design stage, wire load models are generated based on the actual routing in the design and synthesis is redone using those wire load models.
In hierarchical designs, we have to determine which wire load model to use for nets that cross hierarchical boundaries. There are three modes for determining which wire load model to use for nets that cross hierarchical boundaries:
Top:
Applying same wire load models to all nets as if the design has no hierarchy and uses the wire load model specified for the top level of the design hierarchy for all nets in a design and its sub designs.
Enclosed:
The wire load model of the smallest design that fully encloses the net is applied. If the design enclosing the net has no wire load model, then traverses the design hierarchy upward until we finds a wire load model. Enclosed mode is more accurate than top mode when cells in the same design are placed in a contiguous region during layout.
Use enclosed mode if the design has similar logical and physical hierarchies.
Segmented:
Wire load model for each segment of a net is determined by the design encompassing the segment. Nets crossing hierarchical boundaries are divided into segments. For each net segment, the wire load model of the design containing the segment is used. If the design contains a segment that has no wire load model, then traverse the design hierarchy upward until it finds a wire load model.
Performances of deep sub micron ICs are limited by increasing interconnect loading affect. Long global clock networks account for the larger part of the power consumption in chips. Traditional CAD design methodologies are largely affected by the interconnect scaling. Capacitance and resistance of interconnects have increased due to the smaller wire cross sections, smaller wire pitch and longer length. This has resulted in increased RC delay. As technology is advancing scaling of interconnect is also increasing. In such scenario increased RC delay is becoming major bottleneck in improving performance of advanced ICs.
et Delay or Interconnect Delay or Wire Delay or Extrinsic DelaHere the gate delay and the interconnect delay are shown as functions of various technology nodes ranging from 180nm to 60nm. The interconnect delays shown assumes a line where repeaters are connected optimally and includes the delay due to the repeaters. From the graph it can be observed that with the shrinking of technology gate delay reduces but interconnect delay increases.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Limits of Cu/low-k interconnects
At submicron level of 250 nm copper with low-k dielectric was introduced to decrease affects of increasing interconnect delay. But below 130 nm technology node interconnect delays are increasing further despite of introducing low-k dielectric. As the scaling increases new physical and technological effects like resistivity and barrier thickness start dominating and interconnect delay increases. Introduction of repeaters to shorten the interconnect length increases total area. The vias connecting repeaters to global layers can cause blockage in lower metal layers. Thus as the technology improves material limitations will dominate factor in the interconnect delay. Increasing metal layer width will cause increase in metallization layer. This can’t be a solution for the problem as it increases complexity, reliability and cost.
Cu low-k dielectric films are deposited by a special process known as Damascene process. Adhesion property of Cu with dielectric materials is very poor. Under electric bias they easily drift and cause short between metal layers. To avoid this problem a barrier layer is deposited between dielectric and Cu trench. Even though it decreases effective cross section of interconnects compared to drawn dimensions, it improves reliability. The barrier thickness becomes significant in deep submicron level and effective resistance of the interconnect rises further. In addition to this increasing electron scattering and self heating caused by the electron flow in interconnects due to comparable increase in internal chip temperature also contribute to increase interconnect resistance.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
References
[1] Jan M. Rabaey, Anantha Chandrakasan and Borivoje Nikolic, "Digital Integrated Circuits- A Design Perspective", Prentice Hall, Second Edition
[2] Design Compiler User Manual Read the rest of this article >>
We encounter several types of delays in ASIC design. They are as follows:
Gate delay or Intrinsic delay
Net delay or Interconnect delay or Wire delay or Extrinsic delay or Flight time
Transition or Slew
Propagation delay
Contamination delay
Wire delays or extrinsic delays are calculated using output drive strength, input capacitance and wire load models. Other delays are intrinsic properties of each and every gate.
Delays are interdependent on different electrical properties. [Nekoogar]:
Input capacitance of the logic gate is a function of output state, output loads and input slew rate.
Internal timing arcs and output slew rate is a function of switching input(s).
Capacitance of the wire is dependent on frequency.
Internal timing arcs are a function of input slew rates.
Output slew rate is a function of input slew rate on each input.
Wires exhibit RLC characteristics instead of lumped RC.
Gate Delay
Transistors within a gate take a finite time to switch. This means that a change on the input of a gate takes a finite time to cause a change on the output. [Magma]
Gate delay =function of (input transition (slew) time, Cnet+Cpin).
or
Gate delay =function of (input transition (slew) time, Cload).
where Cload=Cnet+Cpin
Cnet-->Net capacitance
Cpin-->pin capacitance of the driven cell
Cell delay is also same as Gate delay.
How gate delay is calculated?
Cell or gate delay is calculated using Non-Linear Delay Models (NLDM). NLDM is highly accurate as it is derived from SPICE characterizations. The delay is a function of the input transition time (i.e. slew) of the cell, the wire capacitance and the pin capacitance of the driven cells. A slow input transition time will slow the rate at which the cell’s transistors can change state logic 1 to logic 0 (or logic 0 to logic 1), as well as a large output load Cload (Cnet + Cpin), thereby increasing the delay of the logic gate.
There is another NLDM table in the library to calculate output transition. Output transition of a cell becomes the input transition of the next cell down the chain.
Table models are usually two-dimensional to allow lookups based on the input slew and the output load (Cload). A sample table is given below.
Input transition and output load values match with table index values
If both input transition and output load values match with table index values then corresponding delay value is directly picked up from the delay “values” table as highlighted by yellow shaded data.
Situation 2:
Output load values doesn't match with table index values
When the actual load capacitance values does not fall directly on or at one of the load-axis index points, the delay is determined by interpolation from the closest points. Note that to carry out interpolation input transition point should match with the any one of the table index values.
Determine the equation for the line segment connecting the two nearest points in the table.
To do this first we need to find the slope value.
Slope m = (y2-y1)/(x2-x1) where (y2-y1) is delay segment (generally in ns) on y axis and (x2-x1) is load segment (generally in pf) on x-axis.
Solve for the delay at the load point of interest.
The linear equation is:
y = mx+c
where
y-->delay (ns)
m-->slope
x-->load capacitance (pf)
i.e. delay=slope*load point of interest (constant value is zero)
Load point of interest means load capacitance value for which delay has to be calculated.
Situation 3:
Both input transition and output load values doesn't match with table index values
If both input transition and load capacitance values do not match exactly with the look up table index values then bilinear interpolation is used.
Multiple linear interpolations (~3) are performed on multiple closest table data points (~4) as shown in highlighted violet color in the look up table.
Situation 4:
Output load values doesn't match with table index values and is outside the table boundary
When the load point is outside of the boundary of the index, the delay is extrapolated to the closest known points.
Lookup value too far out of range of the given table value could lead to inaccuracy. [Cadence]
Intrinsic delay
Intrinsic delay is the delay internal to the gate. This is from input pin of the cell to output pin of the cell.
It is defined as the delay between an input and output pair of a cell, when a near zero slew is applied to the input pin and the output does not see any load condition. It is caused by the internal capacitance associated with its transistor.
This delay is largely dependent on the size of the transistors forming the gate because increasing size of transistors increase internal capacitors.
References
[Nekoogar] Farzad Nekoogar, “Timing Verification of Application Specific Integrated Circuits”, Prentice Hall
Timing analysis is integral part of ASIC/VLSI design flow. Anything else can be compromised but not timing! Timing analysis can be static or dynamic. Dynamic timing analysis verifies functionality of the design by applying input vectors and checking for correct output vectors whereas Static Timing Analysis checks static delay requirements of the circuit without any input or output vectors.
Dynamic timing analysis has to be accomplished and functionality of the design must be cleared before the design is subjected to Static Timing Analysis (STA). Dynamic Timing Analysis (DTA) and Static Timing Analysis (STA) are not alternatives to each other. Quality of the Dynamic Timing Analysis (DTA) increases with the increase of input test vectors. Increased test vectors increase simulation time. Dynamic timing analysis can be used for synchronous as well as asynchronous designs. Static Timing Analysis (STA) can’t run on asynchronous deigns and hence Dynamic Timing Analysis (DTA) is the best way to analyze asynchronous designs. Dynamic Timing Analysis (DTA) is also best suitable for designs having clocks crossing multiple domains.
Example of Dynamic Timing Analysis(DTA) tool is Modelsim (from mentor Graphics), VCS (from Synopsys). DTA is also carried out on post layout netlist to verify that functionality of the design has not changed. Test vectors remain same for both.
SPICE Simulation
Device level timing analysis is carried out using SPICE simulation. SPICE simulation is very essential for full custom designs to verify the electrical properties of the designs. These are calculated based on the mathematical equations that represent electrical properties of devices. Material and some of the electrical properties of the devices, which are represented by either variables or constants, are stored in model files. Examples are threshold voltage of MOSFET, electron density etc. SPICE characterized data is tabulated in technology libraries which becomes basic delay information for the Static Timing Analysis. For example let us consider a AND gate. Several electrical properties such as input and output transition, propagation delay, output capacitance etc are evaluated by this SPICE simulation. SPICE simulated data gives maximum accuracy compared to any other form of simulation. SPICE code is manually written and simulated. Hence for a larger design SPICE simulation is cumbersome job. There are specific tools available for transistor level Static Timing Analysis (STA), (Eg. Pathmill from Synopsys) SPICE simulation being the backbone of all these tools.
What is Static Timing Analysis (STA)?
In Static Timing Analysis (STA) static delays such as gate delay and net delays are considered in each path and these delays are compared against their required maximum and minimum values. Circuit to be analyzed is broken into different timing paths constituting of gates, flip flops and their interconnections. Each timing path has to process the data within a clock period which is determined by the maximum frequency of operation. Cell delays are available in the corresponding technology libraries. Cell delay values are tabulated based on input transition and fanout load which are characterized by SPICE simulation. Net delays are calculated based on the Wire Load Models(WLM) or extracted resistance R and capacitance C. Wire Load Models(WLM) are available in the Technology File. These values are Table Look Up(TLU) values calculated based on the net fanout length.
The static timing analyzer will report the following delays (or it can do following analysis):
Register to Register delays
Setup times of all external synchronous inputs
Clock to Output delays
Pin to Pin combinational delays
Different Analysis Modes-Best, Worst, Typical, On Chip Variation (OCV)
Data to Data Checks
Case Analysis
Multiple Clocks per Register
Minimum Pulse Width Checks
Derived Clocks
Clock Gating Checks
Netlist Editing
Report_clock_timing
Clock Reconvergence Pessimism
Worst-Arrival Slew Propagation
Path-Based Analysis
Debugging Delay Calculation
and many more......!!
The wide spread use of STA can be attributed to several factors [David]:
The basic STA algorithm is linear in runtime with circuit size, allowing analysis of designs in excess of 10 million instances.
The basic STA analysis is conservative in the sense that it will over-estimate the delay of long paths in the circuit and under-estimate the delay of short paths in the circuit. This makes the analysis ”safe”, guaranteeing that the design will function at least as fast as predicted and will not suffer from hold-time violations.
The STA algorithms have become fairly mature, addressing critical timing issues such as interconnect analysis, accurate delay modeling, false or multi-cycle paths, etc.
Delay characterization for cell libraries is clearly defined, forms an effective interface between the foundry and the design team, and is readily available. In addition to this, the Static Timing Analysis (STA) does not require input vectors and has a runtime that is linear with the size of the circuit [Agarwal].
Advantages of STA:
All timing paths are considered for the timing analysis. This is not the case in simulation.
Analysis times are relatively short when compared with event and circuit simulation.
Timing can be analyzed for worst case, best case simultaneously. This type of analysis is not possible in dynamic timing analysis.
Static Timing Analysis (STA) works with timing models. STA has more pessimism and thus gives maximum delay of the design. DTA performs full timing simulation. The problem associated with DTA is the computational complexity involved in finding the input patterns (vectors) that produce maximum delay at the output and hence it is slow.
Disadvantages of STA:
All paths in the design may not run always in worst case delay. Hence the analysis is pessimistic.
Clock related all information has to be fed to the design in the form of constraints.
Inconsistency or incorrectness or under constraining of these constraints may lead to disastrous timing analysis.
STA does not check for logical correctness of the design.
STA is not suitable for asynchronous circuits.
References
[David] David Blaauw, Kaviraj Chopra, Ashish Srivastava and Lou Scheffer, “Statistical Timing Analysis: From basic principles to state-of-the-art.”, Transactions on Computer-Aided Design of Integrated Circuits and Systems (T-CAD), IEEE.
[Agarwal] Agarwal, A. Blaauw, D. Zolotov, V. Sundareswaran, S. Min Zhao Gala, K. and Panda, R., “Statistically Delay computation considering spatial correlations,” Proceedings of the ASP-DAC 2003, pp.271-276, Jan 2003.
Your suggestions/comments can be mailed to: shavakmm@gmail.com If you wish to contribute to this blog you are most welcome. Send articles in word format to the above email id.