Over the years I've done a number of experiments using Verilog, a hardware modeling language. In several of these experiments I have attempted to recreate old CPU designs like the MIT CADR lisp machine and the DEC PDP-8/I. My latest experiment is to recreate the PDP-11, in modern verlog, using modern simulation techniques.
Note that this has been done before. I know of at least 2-3 old microcoded versions and more recently there are 3 other groups which have done this, but in all the cases the code is either not in verilog or is proprietary and closed. Not very helpful.
I have not (yet) delved into SystemC, but I have done some fun work with co-simulation. Most recently I wired my RTL simulation of the pdp-11 in almost-verilog to a "known good" pdp-11 instruction set simulator. The idea is that both the RTL simulation and the instruction set simulator run the same code and at the end of each instruction cycle the results are compared. The "results" are the internal register values, the processor status word and the list of bus operations which occurred (address, type, data).
In a perfect world the two simulations will run in lock step and any deviation is a bug. And this is mostly true. The comparison turns out to be extremely helpful and very valuable.
Again, however, this is not new. I learned this technique from others who are smarter than I am.
While attempting to recreate the pdp-11 I ran into a number of interesting problems. The instruction set is fairly simple but it is not RISC. The effective address computations are complex and in many cases doubled. Let me supply an example:. Here is a list of the 8 addressing modes. A complex instruction can have a source operand (with one of these addressing modes) and a destination operand (with one of these addressing modes). So, in the worst case you need to compute the effective address and do one or more fetches for the source and destination.
mode symbol ea1 ea2 ea3 data side-effect 0 R x x x R x 1 (R) R x x M[R] x 2 (R)+ R X x M[R] R<-R+2 3 @(R)+ R M[R] x M[M[R]] R<-R+2 4 -(R) R-2 x x M[R-2] R<-R-2 5 @-(R) R-2 M[R-2] x M[M[R-2]] R<-R-2 6 X(R) PC M[PC]+R x M[M[PC]+R] x 7 @X(R) PC M[PC]+R M[M[PC]+R] M[M[M[PC]+R]] xSeems complex, yes? Each M[] is a memory read. The basic register indirect is simple. But modes 6 & 7 add the side effect of reading addition operand data from the next instruction location. This increments the pc as well as fetching an offset which gets added to the result of a previous EA calculation.
So, how to implement this? My first thought was a complex state machine. After a while I got frustrated and thought it might be easier just to make a machine which recodes the old pdp-11 instruction into new "risc-like" instructions on the fly. Sort of a just-in-time binary recompilation. I think this is how modern day X86 machines work. The fun idea would be to have several "machines" running ahead and converting the pdp-11 CISC instructions into simple RISC instructions, filling several FIFO's. The then RISC engine could use modern ideas like a multi-stage pipeline, speculative execution and branch prediction. While very cool, I quickly decided that was more complexity than I wanted at this stage.
I do think, however that it might make sense initially to do a simple "recoding engine" and a simple "risc pipeline". I want to do it and compare the gate count to a state machine version.
So, I set out to do a simple state machine version. I tried to compress the states as much as possible but current feel there has to be a decode state, four states for each operand, an execute state and a write-back state. The four states for each operand can be reduced to a little as one, depending on how the instruction decodes. I tried to eliminate the single EA state for each operand but instructions like:
mov @(R5)+,@(R5)+
causes problems. Why? because the value of R5 is incremented twice, once after each EA calculation. If I did the EA and post-increment in one state I needed to special case the increment (to be 2x) if the both registers were equal. And it got to be a big mess. I capitulated, added a state, and reduced the complexity.
I should note here that all pdp-11's, except one, are microcoded. And I can see why.
At some point I do want to try an experiment by adding a pre-fetch unit, keeping at least 3 words available and doing the EA calculations in parallel. The EA calculation will stack up (i.e. stall) queuing up for memory reads, but it has the potential for being more efficient, especially if there is a cache which does burst reads and the line size is at least 8 bytes.
I know this might all sound crazy, but I've learned a lot in the process and almost everything I have learned has been useful in my day job.