CPU Design (school project)

Goal: designing a 5-stage 32-bit CPU capable of executing 43 MIPS/DLX assembly instructions and writing an RTL description (the design is based on the classic RISC pipeline)

Tools:
  • SystemVerilog
  • Synopsys VCS
  • Quartus II + Cyclone II FPGA
  • Perl

Note: If anyone is interested in improving this CPU, contact me and I will gladly provide all the code along with all the details. I will also try to allocate some time for documenting this design.

Completed the following:
  • wrote a synthesizable SystemVerilog RTL (used Quartus II to synthesize (without any timing constraints) for the Cyclone II FPGA)
  • partially verified using Synopsys VCS (placed instructions in the instruction memory, initialized CPU registers, and ran for a few cycles)
  • wrote an assembler (Perl) to generate R-type instructions for verification

Possible improvements include:
  • adding a branch predictor (with a branch target buffer)
  • implementing Tomasulo algorithm in SystemVerilog (or Verilog)
  • making the CPU superscalar
  • building a verification environment using the UVM library (partially done)



Instruction set (floating point instructions are excluded). 
inst example opc1 opc2 description
SLL SLL Rd,Rs1,sa 00 04 Rd = Rs1 << sa
SRL SRL Rd,Rs1,sa 00 06 Rd = Rs1 >> sa
SRA SRA Rd,Rs1,sa 00 07 Rd = Rs1 >> sa, pad with Rs1(msb)
MULT MUL Rd,Rs1,Rs2 00 10 Rd = Rs1 * Rs2
MULTU MULU Rd,Rs1,Rs2 00 11 Rd = Rs1 * Rs2
DIV DIV Rd,Rs1,Rs2 00 12 Rd = Rs1 / Rs2
DIVU DIVU Rd,Rs1,Rs2 00 13 Rd = Rs1 / Rs2
ADD ADD Rd,Rs1,Rs2 00 20 Rd = Rs1 + Rs2
ADDU ADDU Rd,Rs1,Rs2 00 21 Rd = Rs1 + Rs2
SUB SUB Rd,Rs1,Rs2 00 22 Rd = Rs1 - Rs2
SUBU SUBU Rd,Rs1,Rs2 00 23 Rd = Rs1 - Rs2
AND AND Rd,Rs1,Rs2 00 24 Rd = Rs1 & Rs2
OR OR Rd,Rs1,Rs2 00 25 Rd = Rs1 | Rs2
XOR XOR Rd,Rs1,Rs2 00 26 Rd = Rs1 ^ Rs2
SEQ Sc Rd,Rs1,Rs2 00 28 Rd = (Rs1 == Rs2) ? 1 : 0
SNE Sc Rd,Rs1,Rs2 00 29 Rd = (Rs1 != Rs2) ? 1 : 0
SLT Sc Rd,Rs1,Rs2 00 2A Rd = (Rs1 < Rs2) ? 1 : 0
SGT Sc Rd,Rs1,Rs2 00 2B Rd = (Rs1 > Rs2) ? 1 : 0
SLE Sc Rd,Rs1,Rs2 00 2C Rd = (Rs1 <= Rs2) ? 1 : 0
SGE Sc Rd,Rs1,Rs2 00 2D Rd = (Rs1 >= Rs2) ? 1 : 0
J J dst 02 00 PC = (PC+4) + se(dst)
JAL JAL dst 03 00 R31 = (PC+4); PC = (PC+4) + se(imm)
BEQZ BEQZ Rs1,dst 04 00 PC = (Rs1 == 0) ? (se(imm) + PC+4 ) : PC+4
BNEQ BNEZ Rs1,dst 05 00 PC = (Rs1 != 0) ? (se(imm) + PC+4 ) : PC+4
SRLI SRLI Rs2,Rs1,#imm 06 00 Rs2 = Rs1 >> (imm)
ADDI ADDI Rs2,Rs1,#imm 08 00 Rs2 = Rs1 + se(imm)
ADDUI ADDUI Rs2,Rs1,#imm 09 00 Rs2 = Rs1 + use(imm)
SUBI SUBI Rs2,Rs1,#imm 0A 00 Rs2 = Rs1 - se(imm)
SUBUI SUBUI Rs2,Rs1,#imm 0B 00 Rs2 = Rs1 - use(imm)
ANDI ANDI Rs2,Rs1,#imm 0C 00 Rs2 = Rs1 & se(imm)
ORI ORI Rs2,Rs1,#imm 0D 00 Rs2 = Rs1 | se(imm)
XORI XORI Rs2,Rs1,#imm 0E 00 Rs2 = Rs1 ^ se(imm)
JR JR Rs1 12 00 PC = Rs1
SLLI SLLI Rs2,Rs1,#imm 14 00 Rs2 = Rs1 << (imm)
SRAI SRAI Rs2,Rs1,#imm 17 00 Rs2 = Rs1 >> (imm), pad with Rs1(msb)
SEQI ScI Rs2,Rs1,#imm 18 00 Rs2 = (Rs1 == se(imm)) ? 1 : 0
SNEI ScI Rs2,Rs1,#imm 19 00 Rs2 = (Rs1 != se(imm)) ? 1 : 0
SLTI ScI Rs2,Rs1,#imm 1A 00 Rs2 = (Rs1 < se(imm)) ? 1 : 0
SGTI ScI Rs2,Rs1,#imm 1B 00 Rs2 = (Rs1 > se(imm)) ? 1 : 0
SLEI ScI Rs2,Rs1,#imm 1C 00 Rs2 = (Rs1 <= se(imm)) ? 1 : 0
SGEI ScI Rs2,Rs1,#imm 1D 00 Rs2 = (Rs1 >= se(imm)) ? 1 : 0
LW LW Rn,src 23 00 Rs2 = mem[Rs1 + se(imm)]
SW SW Rn,dst 2B 00 mem[Rs1 + se(imm)] = Rs2



Instruction formats. From hardware design standpoint, instruction formats are the primary concern because they dictate how the controller will work.


















The following schematic was used as a reference when writing the top module (Version 1.0, 13 June 2013).










































Building blocks:
  • Flip-flops (registers dividing the design into 5 stages)
  • Multiplexers
  • Adders
  • Memory blocks 
    • CPU registers ( reg [31:0] cpu_regs [0:31]; )
    • Instruction memory (support up to 232 addresses, 4GB of RAM)
    • Data memory (support up to 232 addresses, 4GB of RAM)
  • ALU
  • Sign extension blocks
    • SE16
    • SE26
    • UNS
  • Controller (implemented as a Moore FSM with 9 states)
/* ALL INSTRUCTIONS CLASSIFIED BY FORMAT (9 formats correspond to 9 states)
----------------------------------------------------------------------------------------
R1:   MULT MULTU DIV DIVU ADD ADDU SUB SUBU AND OR XOR SEQ SNE SLT SGT SLE SGE 
R2:   SLL SRL SRA
I1:   SLLI SRAI SEQI SNEI SLTI SGTI SLEI SGEI SRLI ADDI ADDUI SUBI SUBUI ANDI ORI XORI
I2:   BEQZ BNEQ
I3:   SW
I4:   LW 
I5:   JR
J1:   J
J2:   JAL
*/



Buses: (haven't come up with the documentation format yet) buses are 32-bit unless otherwise specified, multiplexer select lines are either 1-bit or 2-bit (count mux inputs)
  • b1   - program counter
  • b2   - instruction [internally connected to b6]
  • b3   - program count incremented to the next instruction [internally connected to b5]
  • b4   - next program count
  • b5   - program count incremented to the next instruction
  • b6   - instruction
  • b7   - destination register for writing data back to CPU registers
  • b8   - data from source register 1
  • b9   - sign extension [16-bit to 32-bit]
  • b10 - data from source register 2 [internally connected to b14]
  • b11 - ALU input B [internally connected to b13]
  • b12 - ALU input A [internally connected to b8]
  • b13 - ALU input B
  • b14 - data from source register 2 [internally connected to b20]
  • b15 - ALU op code
  • b16 - ALU output [internally connected to b19]
  • b17 - ALU input A
  • b18 - program counter [used for jump instructions] (! check jump instruction datapath)
  • b19 - ALU output [internally connected to b29]
  • b20 - data memory address
  • b21 - data memory output [internally connected to b30]
  • b22 - __unused__
  • b23 - __unused__
  • b24 - __unused__
  • b25 - __unused__
  • b26 - sign extension [26-bit to 32-bit]
  • b27 - sign extension [16-bit to 32-bit]
  • b28 - ALU input B
  • b29 - ALU output
  • b30 - data memory output
  • b31 - __unused__
  • b32 - __unused__
  • b33 - data written back to CPU registers
  • b34 - destination register for writing data back to CPU registers (either current or delayed)
  • b35 - write enable for CPU registers (either current or delayed)
  • b36 - data (or return/link address) written back to CPU registers
Control signals: (e.g. we - write enable, we2 - same signal, but delayed, we3 - same thing)
  • wb_sel - select line for write back mux 
  • we - write enable for CPU registers
  • mem_write - write enable for data memory
  • mem_read - read enable for data memory
  • alu_func - select line for ALU function mux
  • pc_sel - select lines for the PC mux which chooses how to update PC
  • bypass - select lines for the bypass mux which chooses which data goes into ALU input A



Sample test: initializing CPU registers with the values shown below, placing instructions in the instructions memory, running Synopsys VCS simulator, and displaying the way data propagates through the CPU pipeline. Although the result is the same as the expected output, this is by no means enough to conclude that the CPU works properly. 
































A linear testbench was written to monitor the following nodes: (note that there is a 1-cycle delay between adjacent stages)
  • IF stage:       
    • b1   - program counter
  • ID stage: (DE in the picture, stands for DECODE, I'm being inconsistent with stage names)      
    • b6   - 32-bit instruction fetched from instruction memory
  • EXE stage:    
    • b15 - ALU opcode
    • b13 - ALU input B
    • b17 - ALU input A
  • MEM stage:   
    • b19 - ALU output (data in)
    • b20 - address
    • b21 - data out
  • WB stage:     
    • b33  - result written back to the CPU registers

2 comments:

  1. Hi Sergy,

    I am interested.

    I would appreciate if you can share the code.

    ReplyDelete