Energy Efficient VLSI: ALU Design

2019, Mar 28    

Introduction

As a student of ECE 471: Energy Efficient VLSI I was tasked with designing a 6-bit ALU with a minimal Energy Delay Product (EDP). This post will cover the design process that went into creating a functional ALU. The post will cover the design, schematic level simulation, layout, and post layout simulation.

The assignment requires the ALU to take two 6-bit operands and a carry in bit as inputs. The ALU has four output flags: carry out, overflow, zero, and parity. There were eight operations that the ALU had to complete. I was interested in giving my ALU the ability to do more operations. All the ALUs operations are listed below. Operations with a * are part of the required functionality while the other operations are additional operations I added. While multiple operations were added the focus of this paper will be on the required operations.

CTRL Operation CTRL Operation
0000 Disable ALU * 1000 Addition with Carry
0001 Bitwise AND * 1001 Subtraction without borrow
0010 Bitwise OR * 1010 Invert
0011 Bitwise XOR * 1011 Increment
0100 Rotate Left without carry * 1100 Rotate Right without Carry
0101 Rotate Left with Carry 1101 Rotate Right with Carry
0110 Logical Shift Left * 1110 Logical Shift Right
0111 Addition without Carry * 1111 Decrement

Design

Research and Design

When choosing a base size for my transistor I simulated the inverter with different ratios for NMOS to PMOS. The process node restricted the smallest transistor size that I could use. For NMOS this was 220 nm. I looked at different ratios between NMOS and PMOS and assessed if it would be good in the design based on the transition time. I started at 2x and increased by x0.5 each test (i.e. 2x, 2.5x, 3x, …). The size where high-to-low propagation time equaled low-to-high propagation time was at 4x. To keep the layout smaller 3x was chosen. This ratio gives good performance over 2x but does not require as much space as 4x.

The initial design consideration was if the architecture would be slice or block based. I used a slice architecture because it allowed for the design to be easily scaled for more input bits. Because of the adder block needing every other input bit inverted a two-bit slice was used. When routing it was challenging because signals would need to go to the other size of the ALU (ex. shifting). Taking everything into consideration the benefits outweighed the negative aspects of the design.

Design of Adder Block

In Lab 6 two different adders were assessed, the ripple-carry adder and the carry look ahead-adder. From this lab, size and speed of the adders were compared. While the carry look-ahead adder gave better performance, it would take more time to implement the design in the final project. In Lab 5 the mirror adder was designed, simulated, and laid out. I was able to use the mirror adder as the base of my ripple-carry adder. Being able to reuse a past designs was the reason for picking the ripple carry adder for the final project.

schematic

Figure: Transistor level schematic of a mirror adder with sizing used in the design.

schematic

Figure: Schematic of the Adder Block.

Design of Logic Block

The logic block is able to perform bitwise AND, OR, XOR, and INVERT.

schematic

Figure: The Schematic for the Logic Block. This is for one bit of the design.

Design of Shifting Block

The shifting block is able to shift left or right. The block is made using a MUX.

schematic

Figure: Schematic for the shifting unit.

Design of Output Buffers

One of the requirements for this project was to drive a 2 pF load. A buffer with the minimum transistor sizes would not be able to drive this load. As a solution to the problem I calculated the optimal number of inverters and the scale factor for them using the equations below.

F = CL/Cin
N = ln(F)
f = √(N*F)

The input capacitance of a base inverter was found in HSpice by using .option captab. This gave an input capacitance of 1 fF. Plugging this number into the above equation I found the optimal number of inverters to be 7.4 with a scale factor of 3.5.

I first considered 6 inverters with a scale factor of 3 for my design. When in the layout stage of the project I found the last inverter was very large, so I removed it from the design. Because the last buffer was removed the now length 5 inverter chain no longer buffered the output, it inverted it. The first inverter in the chain was also removed to keep the inverter chain as a buffer. After simulating a length 4 and 6 inverter chain, I found they have similar performance. I used 4 inverters in my design because it uses less space and the propagation time was still acceptable.

Design of the ALU

The first step of integrating the operational blocks (adder, logic, shifting) was to connect the ALU inputs to the adders. For the ALU to execute all operations the input values needed to be modified. The table below shoes how inputs need to be modified and for what operations. To accomplish this, I used MUXs to switch the inputs. The adder needs one MUX to switch the Carry In input and another MUX to switch the second operand (input B). The logic for the select signals of the MUXs was derived using the table.

The next step was to select the correct output of the operational blocks based on the opcode. The MUXs at the output of the three blocks switches what is sent out of the ALU. Then the signal is buffered.

Operation Control Adder Input A Adder Input B Adder Carry In
Add without Carry 0111 A B 0
Add with Carry 1000 A B Carry In
Subtract without Borrow 1001 A ~B 1
Increment 1011 A 0 1
Decrement 1111 A 11 1110 1

schematic

Figure: Block diagram of the ALU.

Schematic Level Simulation

Once the schematic design was completed, I verified that all the blocks worked. I wanted to write a script that would simulate every input to make sure the ALU was working. I used vector files to have HSpice give an input to the circuit and check the value generated by the ALU. This was challenging and took a lot of time. The script would work for a block but then needed a lot of rework to be used when the block was integrated into a larger block (ex. AND block into logic operational block). This method only worked for the operational blocks and a different method was needed for final integration.

In hindsight I don’t think it is the best to simulate every input. For the 6-bit ALU approximately 1 ms of simulation is need. This is unrealistic because 5 us took 30 seconds to simulate and 1 ms would have taken almost 2 hours per run. A better approach would have been to have common cases and edge cases covered.

The next step was to manually simulate the ALU. In the .spi file pulse functions were used to generate inputs for the ALU. The simulation was run for 500 ns and the values checked. If the values were as expected, then the ALU was determined to be able to perform that operation. The ALU was able to perform all operations listed in the table located in the Introduction.

Layout

The size of the final design ended up being 200 um x 130 um. The ALU contains approximately 1550 transistors.

Layout of ALU Slice

Layout of a 2-bit ALU slice. From left to right is the adder, two logic blocks, two shifting blocks, and output MUXs

schematic

Layout of the ALU

Layout of the ALU. At the far left are the input MUXs to the adders. Moving right there are three copies of the ALU slice stacked vertically. Then the output buffers for the ALU operation result. At the right is the flag logic and output buffers for the flags.

HiRes ALU Layout

schematic

Post-Layout Simulation

Post-layout simulation is used get a more accurate understanding of how the circuit will operate in real life. Post-layout simulation contains the parasitics of wires and components. Based on the testing done the max propagation time was found for each operation. This was used to calculate the max operating frequency of the ALU. It is able to run at 520 MHz.

Energy Delay Product (EDP)

Once the entire ALU was laid out it was simulated with parasitics. Each operation was tested with the worst-case input. The worst case input would cause the state of a flag to change resulting in a longer propagation time. This technique was used to find the largest EDP because EDP is proportional to propagation time. HSpice calculated the dynamic power of the ALU by integrating the supply rail power from when the input reached 10% of its final value to when the output was at 90% of its settling value.

Each time an operation is carried out on the ALU every block is computing a value. Because of this I was expecting the dynamic power to be almost the same for each operation. This ended up being the case. Most operation resulted in ~20 mW being used.

What makes it Energy Efficient?

When creating this design, I tried to make the layout small and not have wires going everywhere. The main purpose of this was to reduce parasitics. The resistance of the wires creates power loss and the capacitance slows down the ALU.

I also simulated the ALU at different operating voltages to find where it had the best Energy Delay Product. I choose to optimize around voltage because EDP is proportional to voltage squared (CL * VDD^2 * tp)/2). A small change in voltage can have a large change in EDP. The ALU it was simulated at two operating voltages, 1.7 V and 1.8 V. Out of the two neither had a clear performance benefit over the other. Some operations performed better at 1.8 V while other performed better at 1.7 V. The ALU operating at 1.7 V consistently had a lower dynamic power. For these reasons I picked 1.7 V to be the operating voltage of the ALU. Given more time I would have simulated with a wider range of voltages.

Future Improvements

For the second generation of my ALU I would implement more ways to make my ALU power efficient. Techniques One technique I would want to use is power gating. This would turn off operational blocks if the operation isn’t using that block. This would reduce the dynamic and static power. Another option to consider is a method to not allow the input to reach an operational block if it is not needed. This would reduce the dynamic power. Having multiple voltage domains is another way to save power. I think the design is too small for multiple voltage domains to be effective.