Mickey
In UML’s Computer Architecture course, we learn about how CPUs work by examining the MIC-1 architecture invented by Andrew Tanenbaum. From what I understand, the most popular educational architecture like this is LC-3, and MIC-1 seems very rarely used in comparison. In fact, the MIC-1 Wikipedia page mentions the simulator and assembler our professors developed.
When I took COMP3080 a year ago, I was forced to deal with the tedium of writing assembly and even microcode by hand. I began to wonder if it was possible to write a programming language for such an underpowered architecture, but I didn’t really know enough about compilers at the time to make much progress on this idea.
A while, I started work on Mickey, a language which compiles to MIC-1 assembly. It is meant to be similar to C, but with more modern syntax. Here is a code sample:
fun factorial(x: int): int
begin
if x <= 0 then
return 1
else
return x * factorial(x - 1)
end
end
fun main(): int
begin
return factorial(8)
end
Mickey features roughly what you’d expect from a compiled language like this. Functions, stack-allocated local variables, static typing, type casting, arithmetic operations (multiplication and division are in software), comparisons, logical operations, pointers, even a memory allocator. Strings are work in progress. Hopefully coming soon are structs, type definitions, maybe even floating point math. Once those are done, I’ll consider this language complete. In the mean time, let’s explore how exactly this language works.
Lexing and parsing
The whole language is implemented in OCaml. We start with a lexer and parser, which are generated by ocamllex and Menhir respectively. I’m not a huge fan of Menhir: it has a crazy learning curve and might as well not have error messages. Seriously: when you have a syntax error it will always just give you Fatal error: exception Mickey.Parser.MenhirBasics.Error. No filenames, no line numbers, not even the grammar rule that was violated. “There was an error somewhere, you figure out the rest.”
The parser takes in text from a file and then produces an AST. The AST is an algebraic data type, with each option representing 1 type of statement. Expressions and statements are segregated from each other into 2 separate types. I’m not sure if this is necessary or even a good idea, but that’s how I set it up. My main programming language, Oops, has only 1 type for ASTs even though it also has a statement/expression distinction (but the only statements are assignments and definitions).
Type checking
The next step is for the AST to get type checked. First, the definitions present in the tree get processed to create a type environment, which maps from global variable/function names to their types. After that is when the actual type checking happens. The function type_check type checks statements while type_of type checks expressions. These functions take an AST and a type environment, and do pattern matching on the AST to figure out what its type is supposed to be. For most statements, this type is TNoReturn, which is the type of a function body that doesn’t return anything. We don’t want this. Functions that return void should explicitly return. We also do checks here to make sure values being assigned to variables are the types that they’re expected to be, and that if and while conditions are booleans.
For expressions, if they’re literal booleans, integers, or strings, their type is obvious. Variables are whatever type they’re defined to be. Type casts allow you to say that any value is of any type, which probably isn’t good, but I need it for the memory allocator. type_of_call determines the type of a function call, and it’s needed to reduce code reuse because function calls are both statements and expressions. See why separating them might not have been a good idea? Most of the time, determining a function call’s type is trivial: it’s whatever the return type is. But we need to make sure that whatever is returned actually lines up with what the programmer says it is. We also need to make sure that the arguments are of the correct type, and that the thing being called is a function in the first place. Like in C, functions can only be referred to by name in Mickey, there’s no lambdas or closures. However, I store functions and normal variables in the same type environment, which means that you could theoretically call anything if you cast it to a function type (but that’s syntactically impossible).
Compilation
After everything has been type checked and verified to be okay, compilation happens. Compilation happens in 4 separate functions though: compile_global_init, compile_globals, comple_stmt, and compile_exp. In the final .asm file produced by compilation, we want things to be laid out in a very certain order: first, the global variables are initialized; then a call to the main function; then the prelude, which includes definitions for operators, the memory allocation functions alloc and free, and functions to get and set local variables and function arguments (why these are needed will be discussed shortly); space for the global variables; the user’s code; then finally space for the heap.
compile_global_init compiles only the code which is responsible for initializing global variables. If such code is present, it must be inside a globals block at the beginning of a file:
globals
x: int # all variable declarations and definitions are separate, and new variables can only be declared in a globals block or in the locals block of a function definition
begin
x = 5
end
If a file imports another file, these global init blocks will all be placed one after the other.
compile_globals just emits assembler directives to allocate space for global variables.
compile_stmt and compile_exp are where most stuff happens. These functions take an AST and a context, which stores the stack offsets of local variables and function arguments. Local variables were absolute hell to get working. MIC-1 has exactly 1 general purpose register, the accumulator, so I decided to use the stack for all operands. However, the MIC-1 has no frame pointer register, so in order to have a consistent offset for locals and function arguments, I had to make a special memory address that would store it, as well as assembly functions called getlocal and setlocal which would load data relative to the frame pointer.
Aside from that, some other interesting stuff has to happen for constants. MIC-1 only allows loading immediate values in the range 0 <= x < 4096, so any numbers outside that range have to have labels devoted to them. Functions also have special prelude to adjust the frame pointer.
When all is said and done, the final output of compilation is 3 strings containing MIC-1 assembly. Remember that they have to be arranged in a particular order with some stuff inserted between them.
Printf.fprintf output "%s" global_init;
Printf.fprintf output "%s\n" (In_channel.open_text "prelude.asm" |> In_channel.input_all);
Printf.fprintf output "%s" globals;
Printf.fprintf output "%s" code;
Printf.fprintf output "\nheap:\n1\n0\n0";
From that point, you can assemble it with the tools our professors created and then run it in the simulator.
Conclusions and the future
I’ve had the idea for this project for a long time, but ultimately I think this project serves mostly as a chance to get some practical experience writing a compiler before working on the much larger, longer-term project of converting the main language I’m working on, Oops, from a tree-walking interpreter into a compiler.
I actually wrote most of this post around a month ago up to the Compilation section, but since then I’ve run into quite a big problem working on this some more. With the way the compiler and type checker are set up, it’s going to be extremely hard to add support for user-defined types. Other limitations like all values being limited to 1 word in size (either an integer or a pointer) will also certainly cause problems in that regard.
The limitations of the MIC-1 architecture also caused a lot of headache in this project: the lack of a frame pointer, the lack of any arithmetic instructions aside from addition, the complete lack of logical instructions, the fact I had to write my own memory allocator… part of me wonders if I should have compiled to x86 or MIPS assembly, which I’m also somewhat familiar with, or to virtual machine bytecode which could then be run by MIC-1. But then I’d have to write a virtual machine in assembly.
Despite the lack of user-defined types or strings and whatever else I wanted to achieve, Mickey is still somewhat usable as an actual programming language. It has functions, local variables, pointers, dynamic memory allocation, and static type checking. That’s good enough for me, and it’s also probably good enough to solve some of the Computer Architecture assignments. Don’t tell Moloney that though.