Clock Tree Synthesis:
- Inputs of CTS
- Clock Tree Synthesis overview
- Clock Tree Synthesis Steps
- CTS Quality Checks
- H-Tree Algorithm
- ICG Cell and related concepts
- Timing Optimization
- Useful Skew
- Debug Timing
Clock – A signal with constant rise and fall with ideally equal width (50% rise and 50% fall of the signal width) helps to control data propagation through the clock elements like Flip-Flop, Latches etc. The clock source mostly present in the top-level design and from thee propagation happens. PLL, Oscillator like constant sources are being used normally in designs to get the clock. Since the clock plays a very important role while propagating data to the frequency of the design so we need to be very careful while creating clocks in design. We already read in SDC section of the physical design inputs that how do we create clock in design. Now once we have created the clock then we need to propagate these clock in a way so that all the clock elements presents in the design need to switch at the same time so to achieve this we need to balance these clocks and here comes Clock Tree Synthesis in picture. Let’s talk in detail about this topic.
# Inputs of Clock Tree Synthesis:
#1. Placement DB
#2. CTS Spec File
- Placement DB:
- Placement DB contains Placement completed Netlist, DEF, LIB, LEF, SDC, UPF and other information’s which contains all the files from the placement database. This can be a zipped file. This DB is also known as PLACE EXIT db. Which means we are not going to do any standard cell placement and related things onwards.
- CTS Spec File:
CTS spec file contains below information’s;
#1. Inverters or buffers to be defined which will be used to balance clock tree.
#2. CTS Exceptions. (End points of clock tree)
#3. Skew group information’s.
#4. Contains target Skew, max target transition and other timing constraints as per clock tree.
#5. Top layer and bottom layer route info. VIA’s info which will be used during clock route.
#6 Clock related info. (Generated clocks {Eg. Clock divider, Clock multiplier etc})
#7 NDR Rule definition
#1. Inverters or buffers to be defined which will be used to balance clock tree:
When I was in early age of VLSI industry, I always thought that buffer is the only element which is used for the clock balancing but when I worked on design, I saw that Inverter plays a very important role in balancing. So, we can say that while balancing the clock tree we can use both buffer and inverter. Normally more inverter is being used in the design. Before understanding why inverter is better than buffer. First, we need to recall what is the difference between a buffer and inverter. (This might be an interview question and I have asked from many candidates). Buffer is basically two inverters connected back to back. SO, wherever we need to use one buffer we can split that buffer into two inverter which balance better in terms of transition and consume less power and area. For example, if we need 30 buffers to balance a tree and are back to back connected then in this case, we can use 60 inverters instead of 30 buffers, but practically we don’t need 60 cells so our requirement can meet within 30-40 inverters. So, we can save power and area along with making transition better for clock.
#2. CTS Exceptions. (End points of clock tree):
There are many points present in the design after which we don’t need clock tree propagation so to avoid unnecessary buffering, we can ask tool to not to go for balancing further to these points.
#3. Skew group information’s:
There are millions of sink pins and we need to balance in the design. There might be the case where design is huge, and we will have high latency so to avoid this we create skew group. We will talk more details on creating and executing skew group later in this section.
#4. Contains target Skew, max target transition and other timing constraints as per clock tree.
Spec file contains the skew values defined, max and min transition of the clocks along with other timing constraint.
#5. Top layer and bottom layer route info. VIA’s info which will be used during clock route.
While creating clock tree we need to route the clock, so we need to define the routing layers, generally we chose top metal layers for clock routing as these metal layers have lower resistance as compared to lower metal layers.
#6 Clock related info. (Generated clocks {Eg. Clock divider, Clock multiplier etc})
If we have missed to define generated clock in SDC we can define it here while balancing the clock.
#7 NDR Rule definition
Clock nets are very sensitive and impacts timing if it changes a little. We already have default rule present in the design but if do route with default rule we might have many issues like crosstalk, transition violations, min pulse width issues at the end so to avoid all these issues in later ECO stages. Non-Default Rule (NDR) means other than default rule it contains some user defined routing rule like Double-Width Double Spacing, single width double spacing etc.
#2. Clock Tree Synthesis Overview:
- Clocks are used to synchronize data communication. Before clock tree synthesis clock path behaves as ideal, where there is equal delay from clock source to sink.
- The concept of clock tree synthesis (CTS) is the automatic insertion of buffers/inverters along the clock paths of the ASIC design to balance the clock delay to all clock inputs. Basically, clock gets evenly distributed throughout the design across all the sequential elements.
- There are number of algorithms to do build clock tree.
- H Tree
- Clock Mesh
- Spine
- Fish bone
In recent days to compete the clock tree balancing we use H tree algorithm; Lets go into the details of H Tree algorithm.
Algorithm steps for the H-Tree.
- Find out all the flops present.
- Find out the center of all the flops.
- Trace clock port to center point.
- Now divide the core into two part and trace both the part and reach to each center.
- Now from this center again divide the area into two and again trace till centers at both the end.
- Repeat this algorithm till the time we reach the flop clock pin.
- Standard H-Tree advantages:
- Very good cross-corner scaling behavior.
- Balanced by construction.
- Assumes shielded and effect of congestion is insignificant.
- Lower power than mesh.
- Standard H-tree disadvantages:
- Need to have power-of-two number of sinks (tap buffers).
- Need rectangular unblocked area.
- Higher power than ad-hoc CTS tree.
- Generalization of H-Tree
- Flexibility in sink placement.
- Non-rectangular floor-plans with multiple blockages supported.
- Intelligent tradeoffs made between skew and power.
- Intended to be used for top of the tree, not whole tree.
- Flexible H-tree – Limitations
- You will need to determine where to insert H-tree(s) in the clock
architecture
● Requires clock architecture understanding
● User specified root pin
● User specified pre-existing leaf pins or sink grid area for new leaf buffers
● Newly inserted leaf buffers become tap buffers for multi-tap CTS - The flexible H-tree only contains buffers or inverters
● No logic or gating
Before going for the clock tree design, we need to have sanity checks where we look for following:
- Design is placed or not. Place Exit should complete.
- Clocks has been defined.
- Clock roots are should not be on hierarchal pins.
#3. Clock Tree Synthesis Steps:
- There are flowing steps which need to perform during the Clock Tree Synthesis:
- Clustering
- DRV Fixing
- Insertion Delay reduction
- Power Reduction
- Balancing
- Post-Conditioning
- Clustering
Depends on the geometry locations the skew groups are being created as per the description in SPEC file.
- DRV Fixing
At this stage DRV’s (max_tran, max_cap, max_length, max_fanout) are being fixed.
- Insertion Delay Reduction
At this stage, insertion delay is getting minimized as much it can be which is our one of the main goals for the Clock Tree Synthesis.
- Power Reduction
As we know clock is a major power consumer so we need to analyze and fix in a way so that power consumption will be less.
- Balancing
The main balancing happens at this stage with the help of different clock buffers and inverters.
- Post-conditioning
At this stage, again DRV’s will be checked and if required then it will be fixed.
#4. CTS Quality Checks:
- There are following quality checks for the Clock Tree Synthesis:
- Minimize Insertion Delay
- Skew Balancing
- Duty Cycle
- Pulse Width
- Clock Tree power consumption
- Signal Integrity and Crosstalk
Let’s discuss these topics in details:
- Minimize Insertion Delay:
- Advantages of the low Latency:
- Less buffer hence less power consumption. As we know clock paths are the heaviest power dissipated path.
- Cell area reduction as less buffer in the clock path.
- Less runtime since less buffer need to insert in the design which saves optimization as well.
Interview Question: If the design meets timing even if the Insertion Delay is high what the things are will be affected. Whether it is accepted or not?
Answer: Still it will affect Power and area. Runtime increases. So we have some insertion delay target which is a must meet.
- Skew Balancing
- Skew: The skew is the difference of time between the clock path. Let’s understand more through the picture.
One interview question I have seen people asking even from 5-6 years of industry experience:
There are two designs, one is having skew zero and another is having some skew value then which design you will choose. I will go for some skew value where clock transition will have some difference which will help lower IR Drop.
Useful Skew:
Useful skew is very important concept in CTS, lets discuss this through example.
In the above picture we can see that the first path is having positive 15 ps of skew and second path is having negative 5 ps of skew and third is having positive 5 ps skew. Now if we can see among these three paths the negative path can borrow skew from the first positive 15ps path then skew will be balanced between the paths. After borrowing the skew, we can see that in below picture the skew looks positive in number for all the three paths. This concept is called as useful skewing and the skew we borrowed is useful skew.
- Duty Cycle
The basic definition of duty cycle is on_time/(on_time+ Off_time). The on time and off time totally depends upon the rise transition and fall transition.
Due to transition differences duty cycle changes and hence the calculation became bad. Practically rise and fall transition are not same. In the below picture we can see that input rise and fall transition are 50%-50% (same) but after travelling through some elements it varies by almost 8% in both rise and fall.
- Pulse Width
If we have variation in rise and fall transition from the input transition of rise and fall then the threshold (50%) will go worse hence the pulse width will decrease. If pulse width decreases, then we might lose data which was about to capture at some time.
- Clock Tree power consumption
As we know that clock network is the heaviest switching element in the design.
Clock tree power depends upon below two factors:
-
- Latency
- Transition
#1. Latency: If latency is less then less buffer in the design so less power consumption
#2. Transition: If transition is good then less power consumption. Transition is being blamed always for the losses happens in the design wrt the trade-off.
- Signal Integrity and Crosstalk:
Let’s take one example, I am doing work from home in my study room attending meeting and my family is sitting in the TV room and watching some news. Even though I am sitting far I can still hear the TV sound which is an extra unnecessary element for me. I don’t want to listen this but still have to. This extra element is NOISE for me. In VLSI we have same situation with the nets routed that even nets are at their track but impacted by the noise from another nets. This unwanted element is called Signal Integrity.
Signal integrity and crosstalk is one of the quality checks of the clock routes. If we have crosstalk, then we might lose data or gain some extra data/logic which was not required. The below picture explains the RC extraction of 2 routed metal layers.
Basically, Crosstalk resulted into two major issue:
#1. Functional Glitch
#2. Timing variations
There are below reasons for crosstalk:
- Increasing number of metal layers
- Routing congested Design
- Thin and long metal layer routed
- Faster waveform due to higher frequencies
- Low voltage design
Practically we can get rid of the crosstalk using below methods:
- Upsize driver of the net having crosstalk
- Layer promotion for the net
- NDR apply to those nets
- Shielding
- Break the long nets to avoid long traverse
- Aggressor downsizing
- Aggressor rip up
- Guard Ring
# Runtime:
- Runtime depends upon how much time it takes to build the clock tree in a design, Optimization wrt to the QOR results and routing of the clock nets.
- Basically, during optimization, we loss more time if we are having more tight constraints and we need to spend more time to achieve the targets after clock tree build. As we have setup and hold both need to take care so timing optimization takes more time.
# Clock Tree Structure:
The clock tree has been divided into three parts- Top, Trunk, Leaf to understand deeper into the CTS quality and balancing. Below is the picture explaining the cock tree structure.
#Clock Tree Network:
Clock tree information can be as per below:
- A transitive fanout of a root pin.
- A sink can belong to more than one clock tree.
Root pins: The starting point of the clock signal.
Internal Pins: The pin with which clock propagation happens from root to sink.
Sink Pins: The terminal point of a clock signal. Or sequential element pins or stop pins or Ignore pins.
#Clock Tree Exceptions:
There are following clock tree exceptions.
-
- Stop Pin – No buffer/inverter insertion beyond this point. (Don’t touch scenario)
- Ignore Pin (Float Pins) – No DRV, No Balance
- Exclude Pin – DRV Fixing but no balancing
- Through Pin – DRV Fixing as well as Balancing
# Timing Analysis and fixing:
As we know the best approach is to look at the problem and route cause it before going to solve it, so first see few sets of issues which can cause the timing issues at after Clock tree synthesis.
Below can be the reason for a broken timing, let’s investigate it in details:
- Clock latencies, skews and uncertainties:
- What is the uncertainty vs. the clock period?
- Are the different clocks correctly balanced (unless there are false paths)?
- Cell distribution over the path:
- Are there suspiciously long buffering chains (>10 buffers back to back)?
- Is it a short (< 5 instances) or a deep logic level path (> 30 instances)?
- Are the drive strengths chosen correctly?
- Are the correct library cells being used (fast cells for timing critical paths)?
- Net load, slew, fanout and wire length:
- Are there unexpectedly large fanouts (> 50) or long nets (> 1000 um)?
- Are there nets with unexpectedly large load or slew compared to other nets?
- Instance and net delay
- Are there instances or nets with unexpected large delay (> 5x) compared to others?
- Net and Cell derating
- Are the derating values realistic? (between 0.8 and 1.2)
- Congested region: Do we have congested region in a particular area from where clock buffer/inverter was not able to place there and went far away. If yes, then we need to de-congest that area so that clock buffer/inverter should get proper physical location to get placed. Placement or routing congestion both can make the timing broken.
- Is it a placement compacted or widely spread over the floorplan?
- Are the instances correctly spread from the start point to the endpoint?
- We need to look at the path topology whether it is straight or detoured.
- The channels we created during floorplan are sized correctly or not.
- Does path cross the power domains or getting detoured due to huge Marcos present tin the design.
- Many times, power domain shapes matter, so it good to check the shape of power domain whether it’s not too big or too small. I personally have faced this issue many times so need to take care of power domain shape to converse timing.
As we have looked into many reasons now so let see what are the steps we can use to debug timing in the design at post-CTS:
-
- Look into timing debug window of any tool for the worst negative slack path.
- Check the path and understand the driver and receiver cell and the cells present in the path.
- Look for the insertion delay and check if some cells having high.
- Check the location of that particular cell I the design that where they are sitting. Are they too far from the driver? Is the net got detour?
- Is the leaf (Driver) cell got stuck into some channel of macros or near high placement congested area.
- Check the issue why this cell got high insertion delay and got detoured. If possible, block the area with hard blockage if it is a channel in between the macros.
#Integrated Clock Gating (ICG) Cell and related concepts:
We always have target to close the design is to meet the PPA. Clock consumes most of the power as it has high switching activities. Being specific clock consumes almost 20% to 40% of dynamic power. Even in the entire clock tree 80% of the power is getting consumed by last stage of clock tree (Leaf cells and near about) from this 20% to 40%. There are many ways to reduce the dynamic power and one of the ways which is used almost in every complex design is Integrated Clock Gating cells. Without clock gating clock will be having very activity and after using clock gating we will see very less activity. Let’s discuss in detail that what is this cell, how where and when we use this cell, how this effect circuit. Timing calculation we will discuss more in timing section.
There are basically two types of clock gating cell.
- Clock gating using AND gate
- Clock gating using Integrated clock gating cell
# Clock gating using AND gate
Normally clock propagation is continuous process in the sequential elements even when there is no data at the data pin which we really don’t need so to stop the clock when data is not there we need some addition element which will control the clock propagation in a way when data is there clock propagates and clock stops when data is not there. From the below circuit we can see that the flip flop clock input is tied with a 2 input AND gate where one input is original signal and other is tied with Enable signal. We know the property of AND Gate is if both the inputs will be logic-1 then only output will be logic 1, using this concept we use the AND gate, if Enable signal is 1 then only clock will propagate. Using this method we save dynamic power by stopping clock transition when it is not required.
<Need to add waveform to make glitch easily>
# Clock gating using Integrated clock gating cell
We know now what our motive is to use clock gating, the problem with clock gating using AND gate is this circuit might come with glitch so to avoid this glitch we need to have some solution. Here Integrated clock gating comes into the picture. The integrated clock gating cell is a made up of latch and AND cell. Let’s investigate the below circuit and understand.
Integrated clock gating cells uses enable signal from the design or external signal also can control this. If we infer ICG cell before clock path, then a new circuit comes into the picture and a new timing path comes into the picture called as **clock gating path** group. We will discuss more about these in timing section. ICG is a must for all the low power design as it saves a huge loss in terms of dynamic power.
Few more information on Integrated clock gating:
- Till final synthesis stage ICG is not getting noticed.
- Timing violations are seen only after Clock Tree Synthesis.
- ICG cells are not skew balanced with the registers present.
- Half cycle path has been introduced which means timing available is less than cycle time.
- Mostly effects timing critical blocks.
- The best way to insert ICG cells are near to leaf cells of the clock tree.
- Clock enable input of the ICG should be generated in functionally related module or same module.
- ICG cells can be cloned if there are too many leaf cell groups is being driven by single ICG. One disadvantage of cloning is latency will get increased.
#Clock Tree Route and NDR
Once we are done with the clock tree balancing after all the clock cells placement, we go for the final routing of the clock nets and fix it, after clock routing whatever left out will be used for signal routing. Now the question is which metal layers and how we should do routing to avoid issues in later stages. Generally, we use mid layers like M5 – M8 of TSMC 7nm where there are M0-M13 metal layers are available for clock routing. We know that higher metal layers having low resistance as compared to the low metal layers but still why we did stop at M5-M8. Normally we use M12, M11, M10, M9 for power routing and then remaining for clock routing. Also, using a stack of via’s is also another issue hence M5-M8 is the best choice for clock routing.
There is always a defined metal width and spacing from foundry which is being used for routing, but if required we change the width and spacing for the metal layers to achieve our PPA target. We can increase the width and height but can’t use lower than what foundry has proposed. During clock building our target is timing and Power as well, since clock is continuous signal and hence consume more power. So, to achieve good PPA we require non-default-routing rule. Normally we use double width and double spacing for clock routing, this might change depends in requirement.
#Few logical questions on CTS:
- Difference between normal buffer and clock buffer?
Ans- Clock buffer have equal rise time and fall time; therefore, pulse width violation is avoided. Normal buffers may not have equal rise and fall time. Clock buffers are usually designed such that an input signal with 50% duty cycle produces an output with 50% duty cycle.
- What if timing is met and insertion delay is still more. Is it okay to proceed? What can be impacted?
If timing is met and still insertion delay is high this situation leads to too much of clock tree buffering hence the entire clock tree will consume more power. To avoid this situation, we use multi point CTS balancing and creates local skew groups.
- Why do we do hold analysis after CTS only?
Before Clock tree synthesis out clock propagation is ideal and clock tree has not yet built. Once clock tree build, we can go for hold analysis. (Skew is zero till we build CTS, lets discuss more in timing section.)
- If there are buffers and inverters present in the library to use to build clock tree which one you will prefer and why?
For clock tree balancing we need both buffer and inverter, as we know buffer circuit is made up of even number of inverters back to back hence using buffer, we have more power consumption.