Area, Power, Delay Performance Analysis of Logic Element Using Various Architecture

Shilpa.B¹, Udaiya Kumar. R², Sankaranarayanan.K³

Department of ECE¹, Associate Professor and Head of ECE², ECET³

ABSTRACT: Field Programmable Gate Arrays (FPGA’s) are consider as power hungry device and not preferred as portable devices due to this disadvantage, hence the power reduction on Logic Element will make FPGA suitable for portable devices. Recently many techniques have been proposed for power reduction in conventional Logic Element (LE’s) in reconfigurable computing devices such as FPGA’s. The conventional logic element used in FPGA’s is the flip flops. The conventional logic element in FPGA’s consists of LUT (Look-Up Table), storage element as D-flip-flop and Multiplexer (MUX). The Output of LUT is connected to input of Multiplexer and storage element and the clock of flip flop is always connected to clock signal. Understanding the suitability of flip-flops, multiplexer and selecting the best topology for a given application is an important issue to fulfill the need of the design to satisfy low power and high performance circuit. This paper presents a widespread comparison in terms of its area, transistor count, delay values and power dissipation. In this paper, we describe the analysis for 16x1 multiplexer, D-flip-flop’s for area, power dissipation and a propagation delay at 16nm technology is carried out in TANNER SPICE TOOL V 13.0. Hence to optimize the Static & Dynamic power in Logic Element of FPGA using 16nm, BSIM-4, Predictive Technology Modeling (PTM) and to study the impacts.

Keywords: FPGA’s, LE’s, LUT, BSIM, PTM, ASIC

I. INTRODUCTION

A field-programmable gate array (FPGA) is integrated circuit designed to be configured by the customer or designer after manufacturing. The FPGA configuration is generally specified using Hardware descriptive language (HDL), similar to that used for an Application specific integrated circuit (ASIC). FPGAs can be used to implement any logical function that an ASIC could perform. FPGAs contain programmable logic components called "logic blocks", and a hierarchy of reconfigurable interconnects that allow the blocks to be "wired together". While Multiplexers are primarily thought of as “data selectors” because they select one of several inputs to be logically connected to the output, they can also be used to implement Boolean functions. A multiplexer consists of n inputs, m-bit select word to connect one of the inputs to the output. To cover every input, we choose m such that m= log₂ (n) where m is the number of select lines. A flip-flop is a bistable circuit which stores a logic state of 0 or 1 in response to a clock pulse with one or more data inputs. In digital circuit design, large proportion contributes to synchronous design and they are operated based on the clock signal to reduce the complexity of the circuit design. In the design of sequential circuits, a major challenge is the design of an efficient D flip-flop (DFF). Several static/dynamic DFF architectures have been proposed [5],[6],[8],[10]and[14]. The explanation towards True-Single-Phase Clocking (TSPC) proposed in [11] is an efficient methodology to achieve very high-speed VLSI design. TSPC is safer and takes less clock signal routing area.

II. FPGA ARCHITECTURE & POWER DISSIPATION:

FPGA Architecture:

The most common FPGA architecture (Fig.1) consists of an array of logic blocks (called Configurable Logic Block, CLB, or Logic Array Block, LAB, depending on vendor), I/O pads, and routing channels. Generally, all the routing channels have the same width (number of wires). Multiple I/O pads may fit into the height of one row or the width of one column in the array. A logic block generally termed as configurable logic block or logic array block (CLB or LAB) consists of a few logical cells (called ALM, LE, Slice etc.).
III. POWER DISSIPATION:

Power dissipation is the rate at which the energy is taking from the source and converted into heat. The total power dissipation is given by

\[ P_{\text{total}} = P_{\text{static}} + P_{\text{dynamic}} \]  

(1)

i. Dynamic power dissipation:

Dynamic power dissipation is caused by transient switching from one state to another. Short circuit power dissipation. In submicron technology dynamic power consumption contributes significantly to overall power consumption. Every time a capacitive node switches from ground to supply, energy of \( C_L V_{DD}^2 \) is consumed. It depends on the switching activity of the signal. Signals in CMOS devices transit back and forth between two logic levels resulting in the charging and discharging of parasitic capacitance. This is the dominant factor of power dissipation (transient switching). For a small instant of time both PMOS and NMOS will be “on” simultaneously. The duration depends on the input and output transition (rise and fall times). So a direct path exists between \( V_{DD} \) and GND (short circuit). The dynamic power consumption is given by

\[ P_{\text{dyn}} = \frac{1}{2} C_L V_{DD}^2 f_c \]  

(2)

Where

- \( C_L \) = load capacitance
- \( V_{DD} \) = supply voltage
- \( f_c \) = clock frequency

These parameters are not completely orthogonal and cannot be optimized independently.

ii. Static power dissipation:

This is caused mainly due to leakage current. When a CMOS integrated circuit is not switching, there should be no DC current paths from \( V_{CC} \) to ground and the device should not draw any supply current at all. However, due to the inherent nature of semiconductors, a small amount of leakage current flows across all reverse-biased junctions on the integrated circuit. These leakages are caused by thermally-generated charge carriers in the diode area. As the temperature of the diode increases, so do the number of these unwanted charge carriers also increases. Leakage currents are mainly due to reduction in Threshold voltage, Channel length and Oxide thickness. Threshold voltage scaling results in substantial increase of sub threshold leakage current. Oxide thickness has to be reduced in proportion to the channel length. Oxide thickness reduction results in increase in the electric field.
### iii. Power Hungry Parts in FPGA

The power hungry parts in FPGA are Logic Block, Routing Channel, Input/output pads. One logic block is a cluster of lookup tables (LUTs) with the cluster size $N$ (i.e., the number of LUTs inside one cluster) and the LUT size $k$ (i.e., the number of inputs to the LUT) as the architectural parameters. Logic blocks are embedded into the routing resources as logic “islands” and segmented wires are used to connect these logic “islands”. There are also switches (called connection blocks) connecting the wire segments to the logic block inputs and outputs. The logic blocks are connected by a two-dimensional (2-D) mesh-like interconnect structure.

### iv. Need for low power FPGA:

- To reduce Device temperature
- To reduce Failure rate
- To reduce Cooling and packaging costs
- To increase Life of the battery
- To reduce System cost
- To reduce overall energy consumption

### IV. LOGIC BLOCK OF FPGA

New Architecture, ‘Simple’ Programmable Logic Blocks, Massive Fabric of Programmable Interconnects, Large Number of Logic Block ‘Islands’ 1,000 … 100,000+ in a ‘Sea’ of Interconnects. The logic blocks are sitting in a “sea” of interconnects wires. Interconnects between wires are programmed by turning on/off transistors at the wire junctions similar to how programmable array logic (PLD) works (using a floating gate CMOS transistor). Large numbers of PLB or CLBs can be wired together using this technique. Input/output from the FPGA is handled via special I/O pads which themselves also contain sequential logic circuitry. The logic block of FPGA (Fig.2) consists of Logic Functions implemented in Lookup Table (Fig.3) Multiplexers (select 1 of N inputs) Flip-Flops, Registers, and Clocked Storage elements. LUT contains Memory Cells to implement small logic functions. Each cell holds ‘0’ or ‘1’ [1],[7]and[5].

![Figure 2: Logic block of FPGA](image)

![Figure 3: LUT Table Implementation](image)
V. EXPERIMENTAL SETUP

Have the various type of 16x1 Multiplexer consists of a standard 16x1 Multiplexer, 16x1 Multiplexer using 8x1, 16x1 Multiplexer using 4x1, 16x1 Multiplexer using Pass Transistor Logic (PTL).

4.1. Standard 16 x 1 Multiplexer:

The standard 16x1 Multiplexer (Fig. 4) consists of 16 inputs, 1 output and 4 select lines. The 16 inputs are A to P and 4 select lines are S0, S1, S2, and S3. If any one of the select line is to be on means the corresponding input will appear as output and vice versa.

4.2. 16 x 1 Multiplexer Using 8x1:

The 16x1 Multiplexer using 8x1 (Fig. 5) consists of two 8x1 Multiplexer and one 2x1 Multiplexer. The 8x1 multiplexer consists of 8 inputs, 3 select lines and one output. The first 8x1 multiplexer consists of three select lines as S0, S1, S2 and 8 inputs are A to H and the second 8x1 multiplexer consists of three select lines as S0, S1, S2 and 8 inputs are I to P. The outputs are inverted and give to the input of 2x1 multiplexer and select line is of S3. Here the output is inverted and we get the final output.
4.3. 16 x 1 Multiplexer Using 4x1:

The 16x1 Multiplexer using 4x1 (Fig.6) consists of five 4x1 Multiplexer. Each 4x1 multiplexer consists of four inputs and two select lines. The first multiplexer consists A to D and the select line is of S0, S1 and the second multiplexer consists E to H and the select line is of S0, S1 and the third multiplexer consists I to L and the select line is of S0, S1 and the fourth multiplexer consists M to P and the select line is of S0, S1 and the fifth multiplexer consists of the output of the other four multiplexer and the select line is of S2, S3. The output of the fifth multiplexer is inverted and we get the final output.

![Figure 6: 16x1 Multiplexer using 4x1](image)

4.4. 16 x 1 Multiplexer Using Pass Transistor:

Pass Transistor Logic (PTL) describes several logic families used in the design of integrated circuits. (Fig.7) It reduces the count of transistors used to make different logic gates, by eliminating redundant transistors. Transistors are used as switches to pass logic levels between nodes of a circuit, instead of as switches connected directly to supply voltages. This reduces the number of active devices, but has the disadvantage that the difference of the voltage between high and low logic levels decreases at each stage.

![Figure 7: Pass transistor based on 16x1 Multiplexer](image)
4.5. Tabulation 1:

Table 1: Comparison of 16x1 multiplexer

<table>
<thead>
<tr>
<th>ARCHITECTURE</th>
<th>NO. OF TRANSISTORS</th>
<th>DYNAMIC POWER (nw)</th>
<th>STATIC POWER (nw)</th>
<th>DELAY (ns)</th>
<th>PDP (ns)</th>
<th>% OF REDUCTION</th>
</tr>
</thead>
<tbody>
<tr>
<td>STANDARD 16x1</td>
<td>162</td>
<td>0.506</td>
<td>5.6</td>
<td>25</td>
<td>15.15</td>
<td>-</td>
</tr>
<tr>
<td>USING 8x1</td>
<td>142</td>
<td>0.422</td>
<td>5.4</td>
<td>15</td>
<td>6.33</td>
<td>30.30</td>
</tr>
<tr>
<td>USING 4x1</td>
<td>130</td>
<td>0.511</td>
<td>4.7</td>
<td>28</td>
<td>14.33</td>
<td>15.67</td>
</tr>
<tr>
<td>USING PASS TRANSISTOR LOGIC (PTL)</td>
<td>35</td>
<td>0.129</td>
<td>0.26</td>
<td>20</td>
<td>2.58</td>
<td>78.70</td>
</tr>
</tbody>
</table>

From the power results in table 1, it is observed that 16x1 Multiplexer using pass transistor logic is efficient for dynamic and static power and it produces less delay and Power Delay Product (PDP).

Figure 8: Architecture Vs dynamic, static and power delay product

Fig. 8 illustrates that comparison of various 16x1 architecture Vs power. In that, compare to all architectures the Pass Transistor Logic (PTL) is efficient. It consumes less static power, dynamic power and Power Delay Product.

4.6. 11 TSPC Data Flip flop:

True Single Phase Clock (TSPC) dynamic CMOS Circuit is operated with one clock signal that is never inverted. Therefore no clock skew exists except for clock delay problems, even at higher clock frequency can be achieved. TSPC is used for high speed and low power operation. The Flip-flop consists of eleven transistors shown in (Fig. 8). Where the clock switching transistor are placed closer to power/ground for high speed of operation. The state transition of Flip-flop occurs at the rising edge of the Clock signal.
4.7. Nand Based D-Flip flop:

D flip-flop is that when the clock input falls to logic 0 and the outputs can change state, the Q output always takes on the state of the D input at the moment of the clock edge. It is constructed based on both master and slave latches. In this configuration output of master latch is the input of slave latch and the output of slave is the output of the Flip-Flop. To receive the input data D depending upon the clock signals CLK and CLKB to use the edge of the circuit shown in (Fig 9).

4.8 Tabulation II:

Table 2: comparison of D-Flip flop

<table>
<thead>
<tr>
<th>ARCHITECTURE</th>
<th>NO. OF TRANSISTORS</th>
<th>DYNAMIC POWER (mW)</th>
<th>STATIC POWER (μW)</th>
<th>DELAY (ns)</th>
<th>PDP (μJ)</th>
<th>% OF REDUCTION</th>
</tr>
</thead>
<tbody>
<tr>
<td>NAND BASED DFF</td>
<td>32</td>
<td>9.15</td>
<td>5.65</td>
<td>24.7</td>
<td>2.26</td>
<td>-</td>
</tr>
<tr>
<td>11 TSPC</td>
<td>11</td>
<td>4.29</td>
<td>2.62</td>
<td>2.12</td>
<td>0.09</td>
<td>53</td>
</tr>
</tbody>
</table>
From the power results in table 2, it is observed that 11TSPC is efficient for dynamic and static power and it produces less delay and Power Delay Product (PDP).

**4.9 Standard 2 x 1 Multiplexer:**

A two-to-one multiplexer is a combinational circuit that uses one control switch (S) to connect one of two input data lines (A or B) to a single output as shown in (Fig.10). Only one of the input data lines can be aligned to the output of the multiplexer at any given time. It’s like sharing ice-cream on a date with one spoon.

![Figure 10: Standard 2x1 Multiplexer](image)

**4.10 PTL Of 2x1 Multiplexer:**

The pass transistor logic gives potentially very effective and uses less number of transistors. In (fig.11), it reduces the count of transistors used to make different logic gates, by eliminating redundant transistors. If several devices are chained in series in a logic path, a conventionally constructed gate may be required to restore the signal voltage to the full value.

![Figure 11: PTL of 2x1 Multiplexer](image)

**VI. OPTIMIZED FPGA LOGIC BLOCK:**

The optimized logic block consists of PASS TRANSISTOR BASED 16x1 MULTIPLEXER, 2x1 MULTIPLEXER and 11TSPC D-Flip flop. First each individual block is analyzed. For each block any one particular method is efficient compare to rest of methods. Hence combining the efficient method of each block. Finally getting the Optimized FPGA Logic Block (Fig.12), this consumes less power and delay.
AREA, POWER, DELAY PERFORMANCE ANALYSIS

Figure 12: Optimized FPGA logic block diagram using 16nm technology. A-16x1 multiplexer using Pass Transistor Logic (PTL), B-clock signal, C-11 True Single Phase Clock (TSPC) D-flip-flop, D, E-select lines, F-output waveform for optimized FPGA logic block

VII. CONCLUSION

Thus the logic block of FPGA is considered as 4-input LUT, D-Flip flop and 2x1 multiplexer. Each individual block is analyzed. First for 4-input LUT is analyzed. Such as Conventional 16x1 MULTIPLEXER, 16x1 MULTIPLEXER using 8x1, 16x1 MULTIPLEXER using 4x1 and PASS TRANSISTOR LOGIC (PTL). Second for D-Flip flop is analyzed. Such as 11TSPC, NAND BASED DFF. Third 2x1 multiplexer is analyzed. Such as standard 2x1 Multiplexer, PASS TRANSISTOR. Hence optimized FPGA Logic Block consist of less power and delay.

ACKNOWLEDGEMENT

The preferred spelling of the word “acknowledgment” in American English is without an “e” after the “g.” Use the singular heading even if you have many acknowledgments. Avoid expressions such as “One of us (S.B.A.) would like to thank ...” Instead, write “F. A. Author thanks ...” Sponsor and financial support acknowledgments are placed in the unnumbered footnote on the first page.

REFERENCES

[1.] Amit Singh and Malgorzata Marek Sadowska, “Efficient Circuit Clustering for Area and Power Reduction in FPGAs” University of California.
[5.] Jason H Anderson and Farid N. Najm, Student Member, IEEE, Fellow, IEEE “Active Leakage Power Optimization for FPGAs”, University of Toronto, Canada [Accessed on 2004].
[12.] Thomas MARCONI, “Efficient Runtime Management of Reconfigurable Hardware Resources” - Magister Teknik in Electrical Engineering