Tools:
- SystemVerilog
- Synopsys VCS
- Quartus II + Cyclone II FPGA
- Perl
Note: If anyone is interested in improving this CPU, contact me and I will gladly provide all the code along with all the details. I will also try to allocate some time for documenting this design.
Completed the following:
- wrote a synthesizable SystemVerilog RTL (used Quartus II to synthesize (without any timing constraints) for the Cyclone II FPGA)
- partially verified using Synopsys VCS (placed instructions in the instruction memory, initialized CPU registers, and ran for a few cycles)
- wrote an assembler (Perl) to generate R-type instructions for verification
Possible improvements include:
Instruction set (floating point instructions are excluded).
- adding a branch predictor (with a branch target buffer)
- implementing Tomasulo algorithm in SystemVerilog (or Verilog)
- making the CPU superscalar
- building a verification environment using the UVM library (partially done)
Instruction set (floating point instructions are excluded).
inst | example | opc1 | opc2 | description |
SLL | SLL Rd,Rs1,sa | 00 | 04 | Rd = Rs1 << sa |
SRL | SRL Rd,Rs1,sa | 00 | 06 | Rd = Rs1 >> sa |
SRA | SRA Rd,Rs1,sa | 00 | 07 | Rd = Rs1 >> sa, pad with Rs1(msb) |
MULT | MUL Rd,Rs1,Rs2 | 00 | 10 | Rd = Rs1 * Rs2 |
MULTU | MULU Rd,Rs1,Rs2 | 00 | 11 | Rd = Rs1 * Rs2 |
DIV | DIV Rd,Rs1,Rs2 | 00 | 12 | Rd = Rs1 / Rs2 |
DIVU | DIVU Rd,Rs1,Rs2 | 00 | 13 | Rd = Rs1 / Rs2 |
ADD | ADD Rd,Rs1,Rs2 | 00 | 20 | Rd = Rs1 + Rs2 |
ADDU | ADDU Rd,Rs1,Rs2 | 00 | 21 | Rd = Rs1 + Rs2 |
SUB | SUB Rd,Rs1,Rs2 | 00 | 22 | Rd = Rs1 - Rs2 |
SUBU | SUBU Rd,Rs1,Rs2 | 00 | 23 | Rd = Rs1 - Rs2 |
AND | AND Rd,Rs1,Rs2 | 00 | 24 | Rd = Rs1 & Rs2 |
OR | OR Rd,Rs1,Rs2 | 00 | 25 | Rd = Rs1 | Rs2 |
XOR | XOR Rd,Rs1,Rs2 | 00 | 26 | Rd = Rs1 ^ Rs2 |
SEQ | Sc Rd,Rs1,Rs2 | 00 | 28 | Rd = (Rs1 == Rs2) ? 1 : 0 |
SNE | Sc Rd,Rs1,Rs2 | 00 | 29 | Rd = (Rs1 != Rs2) ? 1 : 0 |
SLT | Sc Rd,Rs1,Rs2 | 00 | 2A | Rd = (Rs1 < Rs2) ? 1 : 0 |
SGT | Sc Rd,Rs1,Rs2 | 00 | 2B | Rd = (Rs1 > Rs2) ? 1 : 0 |
SLE | Sc Rd,Rs1,Rs2 | 00 | 2C | Rd = (Rs1 <= Rs2) ? 1 : 0 |
SGE | Sc Rd,Rs1,Rs2 | 00 | 2D | Rd = (Rs1 >= Rs2) ? 1 : 0 |
J | J dst | 02 | 00 | PC = (PC+4) + se(dst) |
JAL | JAL dst | 03 | 00 | R31 = (PC+4); PC = (PC+4) + se(imm) |
BEQZ | BEQZ Rs1,dst | 04 | 00 | PC = (Rs1 == 0) ? (se(imm) + PC+4 ) : PC+4 |
BNEQ | BNEZ Rs1,dst | 05 | 00 | PC = (Rs1 != 0) ? (se(imm) + PC+4 ) : PC+4 |
SRLI | SRLI Rs2,Rs1,#imm | 06 | 00 | Rs2 = Rs1 >> (imm) |
ADDI | ADDI Rs2,Rs1,#imm | 08 | 00 | Rs2 = Rs1 + se(imm) |
ADDUI | ADDUI Rs2,Rs1,#imm | 09 | 00 | Rs2 = Rs1 + use(imm) |
SUBI | SUBI Rs2,Rs1,#imm | 0A | 00 | Rs2 = Rs1 - se(imm) |
SUBUI | SUBUI Rs2,Rs1,#imm | 0B | 00 | Rs2 = Rs1 - use(imm) |
ANDI | ANDI Rs2,Rs1,#imm | 0C | 00 | Rs2 = Rs1 & se(imm) |
ORI | ORI Rs2,Rs1,#imm | 0D | 00 | Rs2 = Rs1 | se(imm) |
XORI | XORI Rs2,Rs1,#imm | 0E | 00 | Rs2 = Rs1 ^ se(imm) |
JR | JR Rs1 | 12 | 00 | PC = Rs1 |
SLLI | SLLI Rs2,Rs1,#imm | 14 | 00 | Rs2 = Rs1 << (imm) |
SRAI | SRAI Rs2,Rs1,#imm | 17 | 00 | Rs2 = Rs1 >> (imm), pad with Rs1(msb) |
SEQI | ScI Rs2,Rs1,#imm | 18 | 00 | Rs2 = (Rs1 == se(imm)) ? 1 : 0 |
SNEI | ScI Rs2,Rs1,#imm | 19 | 00 | Rs2 = (Rs1 != se(imm)) ? 1 : 0 |
SLTI | ScI Rs2,Rs1,#imm | 1A | 00 | Rs2 = (Rs1 < se(imm)) ? 1 : 0 |
SGTI | ScI Rs2,Rs1,#imm | 1B | 00 | Rs2 = (Rs1 > se(imm)) ? 1 : 0 |
SLEI | ScI Rs2,Rs1,#imm | 1C | 00 | Rs2 = (Rs1 <= se(imm)) ? 1 : 0 |
SGEI | ScI Rs2,Rs1,#imm | 1D | 00 | Rs2 = (Rs1 >= se(imm)) ? 1 : 0 |
LW | LW Rn,src | 23 | 00 | Rs2 = mem[Rs1 + se(imm)] |
SW | SW Rn,dst | 2B | 00 | mem[Rs1 + se(imm)] = Rs2 |
Instruction formats. From hardware design standpoint, instruction formats are the primary concern because they dictate how the controller will work.
The following schematic was used as a reference when writing the top module (Version 1.0, 13 June 2013).
Building blocks:
- Flip-flops (registers dividing the design into 5 stages)
- Multiplexers
- Adders
- Memory blocks
- CPU registers ( reg [31:0] cpu_regs [0:31]; )
- Instruction memory (support up to 232 addresses, 4GB of RAM)
- Data memory (support up to 232 addresses, 4GB of RAM)
- ALU
- Sign extension blocks
- SE16
- SE26
- UNS
- Controller (implemented as a Moore FSM with 9 states)
/* ALL INSTRUCTIONS CLASSIFIED BY FORMAT (9 formats correspond to 9 states) ---------------------------------------------------------------------------------------- R1: MULT MULTU DIV DIVU ADD ADDU SUB SUBU AND OR XOR SEQ SNE SLT SGT SLE SGE R2: SLL SRL SRA I1: SLLI SRAI SEQI SNEI SLTI SGTI SLEI SGEI SRLI ADDI ADDUI SUBI SUBUI ANDI ORI XORI I2: BEQZ BNEQ I3: SW I4: LW I5: JR J1: J J2: JAL */
Buses: (haven't come up with the documentation format yet) buses are 32-bit unless otherwise specified, multiplexer select lines are either 1-bit or 2-bit (count mux inputs)
- b1 - program counter
- b2 - instruction [internally connected to b6]
- b3 - program count incremented to the next instruction [internally connected to b5]
- b4 - next program count
- b5 - program count incremented to the next instruction
- b6 - instruction
- b7 - destination register for writing data back to CPU registers
- b8 - data from source register 1
- b9 - sign extension [16-bit to 32-bit]
- b10 - data from source register 2 [internally connected to b14]
- b11 - ALU input B [internally connected to b13]
- b12 - ALU input A [internally connected to b8]
- b13 - ALU input B
- b14 - data from source register 2 [internally connected to b20]
- b15 - ALU op code
- b16 - ALU output [internally connected to b19]
- b17 - ALU input A
- b18 - program counter [used for jump instructions] (! check jump instruction datapath)
- b19 - ALU output [internally connected to b29]
- b20 - data memory address
- b21 - data memory output [internally connected to b30]
- b22 - __unused__
- b23 - __unused__
- b24 - __unused__
- b25 - __unused__
- b26 - sign extension [26-bit to 32-bit]
- b27 - sign extension [16-bit to 32-bit]
- b28 - ALU input B
- b29 - ALU output
- b30 - data memory output
- b31 - __unused__
- b32 - __unused__
- b33 - data written back to CPU registers
- b34 - destination register for writing data back to CPU registers (either current or delayed)
- b35 - write enable for CPU registers (either current or delayed)
- b36 - data (or return/link address) written back to CPU registers
- wb_sel - select line for write back mux
- we - write enable for CPU registers
- mem_write - write enable for data memory
- mem_read - read enable for data memory
- alu_func - select line for ALU function mux
- pc_sel - select lines for the PC mux which chooses how to update PC
- bypass - select lines for the bypass mux which chooses which data goes into ALU input A
Sample test: initializing CPU registers with the values shown below, placing instructions in the instructions memory, running Synopsys VCS simulator, and displaying the way data propagates through the CPU pipeline. Although the result is the same as the expected output, this is by no means enough to conclude that the CPU works properly.
A linear testbench was written to monitor the following nodes: (note that there is a 1-cycle delay between adjacent stages)
- IF stage:
- b1 - program counter
- ID stage: (DE in the picture, stands for DECODE, I'm being inconsistent with stage names)
- b6 - 32-bit instruction fetched from instruction memory
- EXE stage:
- b15 - ALU opcode
- b13 - ALU input B
- b17 - ALU input A
- MEM stage:
- b19 - ALU output (data in)
- b20 - address
- b21 - data out
- WB stage:
- b33 - result written back to the CPU registers
Awesome ! excellent job !
ReplyDeleteHi Sergy,
ReplyDeleteI am interested.
I would appreciate if you can share the code.