00xnor Sergey Ostrikov: CPU Design (school project)

Goal: designing a 5-stage 32-bit CPU capable of executing 43 MIPS/DLX assembly instructions and writing an RTL description (the design is based on the classic RISC pipeline)

Tools:

SystemVerilog
Synopsys VCS
Quartus II + Cyclone II FPGA
Perl

Note: If anyone is interested in improving this CPU, contact me and I will gladly provide all the code along with all the details. I will also try to allocate some time for documenting this design.

Completed the following:

wrote a synthesizable SystemVerilog RTL (used Quartus II to synthesize (without any timing constraints) for the Cyclone II FPGA)
partially verified using Synopsys VCS (placed instructions in the instruction memory, initialized CPU registers, and ran for a few cycles)
wrote an assembler (Perl) to generate R-type instructions for verification

Possible improvements include:

adding a branch predictor (with a branch target buffer)
implementing Tomasulo algorithm in SystemVerilog (or Verilog)
making the CPU superscalar
building a verification environment using the UVM library (partially done)

Instruction set (floating point instructions are excluded).

inst	example	opc1	opc2	description
SLL	SLL Rd,Rs1,sa	00	04	Rd = Rs1 << sa
SRL	SRL Rd,Rs1,sa	00	06	Rd = Rs1 >> sa
SRA	SRA Rd,Rs1,sa	00	07	Rd = Rs1 >> sa, pad with Rs1(msb)
MULT	MUL Rd,Rs1,Rs2	00	10	Rd = Rs1 * Rs2
MULTU	MULU Rd,Rs1,Rs2	00	11	Rd = Rs1 * Rs2
DIV	DIV Rd,Rs1,Rs2	00	12	Rd = Rs1 / Rs2
DIVU	DIVU Rd,Rs1,Rs2	00	13	Rd = Rs1 / Rs2
ADD	ADD Rd,Rs1,Rs2	00	20	Rd = Rs1 + Rs2
ADDU	ADDU Rd,Rs1,Rs2	00	21	Rd = Rs1 + Rs2
SUB	SUB Rd,Rs1,Rs2	00	22	Rd = Rs1 - Rs2
SUBU	SUBU Rd,Rs1,Rs2	00	23	Rd = Rs1 - Rs2
AND	AND Rd,Rs1,Rs2	00	24	Rd = Rs1 & Rs2
OR	OR Rd,Rs1,Rs2	00	25	Rd = Rs1 \| Rs2
XOR	XOR Rd,Rs1,Rs2	00	26	Rd = Rs1 ^ Rs2
SEQ	Sc Rd,Rs1,Rs2	00	28	Rd = (Rs1 == Rs2) ? 1 : 0
SNE	Sc Rd,Rs1,Rs2	00	29	Rd = (Rs1 != Rs2) ? 1 : 0
SLT	Sc Rd,Rs1,Rs2	00	2A	Rd = (Rs1 < Rs2) ? 1 : 0
SGT	Sc Rd,Rs1,Rs2	00	2B	Rd = (Rs1 > Rs2) ? 1 : 0
SLE	Sc Rd,Rs1,Rs2	00	2C	Rd = (Rs1 <= Rs2) ? 1 : 0
SGE	Sc Rd,Rs1,Rs2	00	2D	Rd = (Rs1 >= Rs2) ? 1 : 0
J	J dst	02	00	PC = (PC+4) + se(dst)
JAL	JAL dst	03	00	R31 = (PC+4); PC = (PC+4) + se(imm)
BEQZ	BEQZ Rs1,dst	04	00	PC = (Rs1 == 0) ? (se(imm) + PC+4 ) : PC+4
BNEQ	BNEZ Rs1,dst	05	00	PC = (Rs1 != 0) ? (se(imm) + PC+4 ) : PC+4
SRLI	SRLI Rs2,Rs1,#imm	06	00	Rs2 = Rs1 >> (imm)
ADDI	ADDI Rs2,Rs1,#imm	08	00	Rs2 = Rs1 + se(imm)
ADDUI	ADDUI Rs2,Rs1,#imm	09	00	Rs2 = Rs1 + use(imm)
SUBI	SUBI Rs2,Rs1,#imm	0A	00	Rs2 = Rs1 - se(imm)
SUBUI	SUBUI Rs2,Rs1,#imm	0B	00	Rs2 = Rs1 - use(imm)
ANDI	ANDI Rs2,Rs1,#imm	0C	00	Rs2 = Rs1 & se(imm)
ORI	ORI Rs2,Rs1,#imm	0D	00	Rs2 = Rs1 \| se(imm)
XORI	XORI Rs2,Rs1,#imm	0E	00	Rs2 = Rs1 ^ se(imm)
JR	JR Rs1	12	00	PC = Rs1
SLLI	SLLI Rs2,Rs1,#imm	14	00	Rs2 = Rs1 << (imm)
SRAI	SRAI Rs2,Rs1,#imm	17	00	Rs2 = Rs1 >> (imm), pad with Rs1(msb)
SEQI	ScI Rs2,Rs1,#imm	18	00	Rs2 = (Rs1 == se(imm)) ? 1 : 0
SNEI	ScI Rs2,Rs1,#imm	19	00	Rs2 = (Rs1 != se(imm)) ? 1 : 0
SLTI	ScI Rs2,Rs1,#imm	1A	00	Rs2 = (Rs1 < se(imm)) ? 1 : 0
SGTI	ScI Rs2,Rs1,#imm	1B	00	Rs2 = (Rs1 > se(imm)) ? 1 : 0
SLEI	ScI Rs2,Rs1,#imm	1C	00	Rs2 = (Rs1 <= se(imm)) ? 1 : 0
SGEI	ScI Rs2,Rs1,#imm	1D	00	Rs2 = (Rs1 >= se(imm)) ? 1 : 0
LW	LW Rn,src	23	00	Rs2 = mem[Rs1 + se(imm)]
SW	SW Rn,dst	2B	00	mem[Rs1 + se(imm)] = Rs2

Instruction formats. From hardware design standpoint, instruction formats are the primary concern because they dictate how the controller will work.

The following schematic was used as a reference when writing the top module (Version 1.0, 13 June 2013).

Building blocks:

Flip-flops (registers dividing the design into 5 stages)
Multiplexers
Adders
Memory blocks

CPU registers ( reg [31:0] cpu_regs [0:31]; )
Instruction memory (support up to 2³² addresses, 4GB of RAM)
Data memory (support up to 2³² addresses, 4GB of RAM)

ALU
Sign extension blocks

SE16
SE26
UNS

Controller (implemented as a Moore FSM with 9 states)

/* ALL INSTRUCTIONS CLASSIFIED BY FORMAT (9 formats correspond to 9 states)
----------------------------------------------------------------------------------------
R1:   MULT MULTU DIV DIVU ADD ADDU SUB SUBU AND OR XOR SEQ SNE SLT SGT SLE SGE 
R2:   SLL SRL SRA
I1:   SLLI SRAI SEQI SNEI SLTI SGTI SLEI SGEI SRLI ADDI ADDUI SUBI SUBUI ANDI ORI XORI
I2:   BEQZ BNEQ
I3:   SW
I4:   LW 
I5:   JR
J1:   J
J2:   JAL
*/

Buses: (haven't come up with the documentation format yet) buses are 32-bit unless otherwise specified, multiplexer select lines are either 1-bit or 2-bit (count mux inputs)

b1 - program counter
b2 - instruction [internally connected to b6]
b3 - program count incremented to the next instruction [internally connected to b5]
b4 - next program count
b5 - program count incremented to the next instruction
b6 - instruction
b7 - destination register for writing data back to CPU registers
b8 - data from source register 1
b9 - sign extension [16-bit to 32-bit]
b10 - data from source register 2 [internally connected to b14]
b11 - ALU input B [internally connected to b13]
b12 - ALU input A [internally connected to b8]
b13 - ALU input B
b14 - data from source register 2 [internally connected to b20]
b15 - ALU op code
b16 - ALU output [internally connected to b19]
b17 - ALU input A
b18 - program counter [used for jump instructions] (! check jump instruction datapath)
b19 - ALU output [internally connected to b29]
b20 - data memory address
b21 - data memory output [internally connected to b30]
b22 - __unused__
b23 - __unused__
b24 - __unused__
b25 - __unused__
b26 - sign extension [26-bit to 32-bit]
b27 - sign extension [16-bit to 32-bit]
b28 - ALU input B
b29 - ALU output
b30 - data memory output
b31 - __unused__
b32 - __unused__
b33 - data written back to CPU registers
b34 - destination register for writing data back to CPU registers (either current or delayed)
b35 - write enable for CPU registers (either current or delayed)
b36 - data (or return/link address) written back to CPU registers

Control signals: (e.g. we - write enable, we2 - same signal, but delayed, we3 - same thing)

wb_sel - select line for write back mux
we - write enable for CPU registers
mem_write - write enable for data memory
mem_read - read enable for data memory
alu_func - select line for ALU function mux
pc_sel - select lines for the PC mux which chooses how to update PC
bypass - select lines for the bypass mux which chooses which data goes into ALU input A

Sample test: initializing CPU registers with the values shown below, placing instructions in the instructions memory, running Synopsys VCS simulator, and displaying the way data propagates through the CPU pipeline. Although the result is the same as the expected output, this is by no means enough to conclude that the CPU works properly.

A linear testbench was written to monitor the following nodes: (note that there is a 1-cycle delay between adjacent stages)

IF stage:

b1 - program counter

ID stage: (DE in the picture, stands for DECODE, I'm being inconsistent with stage names)

b6 - 32-bit instruction fetched from instruction memory

EXE stage:

b15 - ALU opcode
b13 - ALU input B
b17 - ALU input A

MEM stage:

b19 - ALU output (data in)
b20 - address
b21 - data out

WB stage:

b33 - result written back to the CPU registers

ARM Cortex-M4

STM32 F4

Uncategorized

CPU Design (school project)

2 comments: