So I said in my last post about Oops that it “ballooned in scope significantly.” Originally, Mickey was just a compiler to MIC-1 assembly. Turning assembly into bytes wasn’t within the scope of the project.

This caused 2 problems. Firstly, I had no unit tests, but needed some. I could test for some desired assembly to be output, but that felt like it was focusing too much on the details and not enough on the big picture. I don’t care what specific instructions are used to multiply 2 numbers together, what I care about is making sure 6 * 7 = 42. Secondly, the reason why I quit working on the original iteration of Mickey: I didn’t know how to deal with structs.

When you compile something like x.value, you need to calculate the offset that the value field has within whatever type x is. But we’re compiling, and we’re done with type checking, we don’t need that info any more right? But then how do we figure out what x’s type is so that we can find the field offset?

How type checking worked in the original Mickey: the AST is put through a function called type_check which type checks statements, making sure functions return the right type, global variables have values of the right type assigned to them, etc. There’s also a function called type_of which determines the type of an expression. Statements don’t have types, but expressions do. And after type checking, all that expression type info is thrown away… that was my biggest mistake.

I didn’t understand this at the time though. I was relatively pleased with my progress on Mickey, as in my mind creating it was mainly an exercise in learning how compilers work and not anything serious, so I shelved the project for like 2 months.

What motivated me to come back was actually a question posed by a friend. “Do you think it’d be possible to write Evil Hangman for MIC-1?”

Evil Hangman is a name that makes any UML CS student shudder. It is the culminating project of the Computing II Lab, where you have to write a Hangman program that constantly changes the word being guessed… in C, with no libraries. This entails things like writing implementations of AVL trees and variable-length strings, all by yourself. I think I remember my Computing II professor saying that on average around 40% of students don’t finish the project on time.

With data structures being such a big focus of that project, if I wanted to make a language that could implement it, custom user-defined data types were an absolute must. Mickey in its 1st implementation could have probably supported it, but it would have taken so much effort and restructuring of the code I decided it would be easier to just start over again. Plus, as I said earlier, I needed unit tests.

But if I want to test the behavior of a compiled program and not just its implementation, I’d need to write a MIC-1 simulator of my own. Mickey was written in OCaml and the MIC-1 simulator developed at UML was written in C. Maybe I could have tried to dig through the source code and figure out what functions I’d need to FFI, but once again, I figured doing it myself would probably be easier. Plus, I’ve written assembly-level CPU simulators before, so it wouldn’t be a big deal.

OK, so I’d now have the ability to easily run assembled code from my tests. I could start with some Mickey code, compile it to assembly, and then… oh wait, I’d have to turn the assembly into bytes.

The MIC-1 tools we used during Computer Architecture did include an assembler, but again, it was written in C. Didn’t want to bother dealing with that, so fine, I’ll write an assembler too.

We aren’t done yet! In the original Mickey implementation, we awkwardly insert an assembly “prelude” at the beginning of every file after compiling it to assembly. This sets up stuff like the frame pointer, global variables, and calls the main function. Compiling this code that rarely ever changes multiple times sounds wasteful, right? Guess we’ll need a linker too…

So, Mickey now consists of 4 mostly separate tools: the compiler (Mickey -> assembly), the assembler (assembly -> object file), the linker (object files -> executable), and the simulator (runs the executable). I’ll now go through the pipeline in more detail and try to explain what actually happens at each step.

Lexing and parsing

During lexing, the Mickey source code is split up into tokens, which are then parsed to create an abstract syntax tree (or sometimes a very unhelpful error message).

Between the original Mickey implementation and this one, some syntax changed. The factorial function would have previously been:

fun factorial(x: int): int
begin
	if x <= 0 then
		return 1
	else
		return x * factorial(x - 1)
	end
end

But is now:

fun factorial(x: int): int {
	if(x <= 0)
		1
	else
		x * factorial(x - 1)
}

A bit more C-like, with braces around function bodies and parentheses around if/while conditions. Do notice the lack of returns now though: explicit returns in the original caused a lot of headaches during type checking. You’d have to search through the function body for every return and make sure they were all the right type. And what if a function returned nothing, i.e. had no return statement at all? The “return the last expression” approach is definitely a bit more restrictive, but it makes things easier overall (for me, probably not any potential language users though).

Preprocessing

Mickey has a preprocessor, but at the moment it’s a little underused. Which is to say, all it does right now is handle importing other files. When an import statement is encountered, it just replaces the SImport node in the AST with the entire AST of the file it’s referencing. I want to add macros too (and unlike in C they’d be AST-level replacements and not just text-level), even if only to give this module a bit more usage, but they don’t feel like a very important feature to me. They’d probably go unused.

Type checking

In the beginning, type checking was actually the same as it was in the original implementation. Type info was thrown away after type checking expressions. This was fine, until I tried to add custom types.

I don’t exactly like the way that I solved it, but apparently it’s the standard way to do it: annotated ASTs.

There are actually 2 types for expression ASTs:

type 'a t = 'a exp * 'a
and 'a exp =
| EInt of int
| ECall of string * 'a t list
| EVar of string
| EUnary of unary_op * 'a t
| EBinary of 'a t * binary_op * 'a t
| ESet of 'a t * 'a t
| EBreak of 'a t
| EBool of bool
| EIf of 'a t * 'a t * 'a t
| EWhile of 'a t * 'a t
| EAs of 'a t * Type.t
| EUnit
| EString of string
| EBlock of 'a t list
| EIndex of 'a t * 'a t
| EChar of string (* string since it's unescaped; the assembler will escape it *)
| EStruct of string * (string * 'a t) list
| EDot of 'a t * string
| EArray of 'a t list
| EAddrOf of 'a t

Exp.exp is just a normal expression AST. 'a Exp.t is an expression and… something else. The parser creates something of type unit Exp.t, so each AST node is effectively bundled with nothing. No new info is added.

During type checking however, after an expression is type checked, its type is added to the tree as an annotation of that node. The function to type check an entire program (Type_check.annotate) takes a unit Stmt.t (and of course statement ASTs have to carry the annotation type of the expressions it depends on) and returns a Type.t Stmt.t. The quite annoying thing is that 99% of the time, the non-annotation, actual AST part of an Exp.t doesn’t change during type checking. The only time it does when we have type casts: 0 as bool turns from (EAs (EInt 0, TBool), ()) to (EInt 0, TBool) after type checking. 0, of course, isn’t actually a boolean, but it’s good enough to just say it is. The annotate function also takes as arguments a type environment, mapping defined names to types, and type definitions, mapping defined type names to their definitions. When new variables/types are declared, they are added to these structures. I think what I call the “type definitions” is usually called the “type environment” but I’m not sure what else to call it.

“Program”

I have no idea what to call this stage. In the original Mickey I had this awkward part during code generation:

Printf.fprintf output "%s" global_init;
Printf.fprintf output "%s\n" (In_channel.open_text "prelude.asm" |> In_channel.input_all);
Printf.fprintf output "%s" globals;
Printf.fprintf output "%s" code;
Printf.fprintf output "\nheap:\n1\n0\n0";

There’s different “sections” of the code we have to consider: code for initializing the globals, then there’s the prelude, then space for the globals, the actual code, and then a label for the start of the heap.

To solve this, I created a type called Program.t which lets you add instructions, global constants, new labels, etc. without having to worry about where in the code all that stuff will be positioned. So, you can call Program.add_constant program "x" (CInt 5) and know that when the assembly is generated, there will be a global int constant called x declared at the top of the program. I think in LLVM this kind of construct is called a “builder.”

Compiling

The “Program” is used as a space to hold compiled code. Compilation works pretty much as you’d expect, recursing down the AST and generating code. The compilation functions also take the type definitions generated during the type checking so that we can solve the struct problem faced by the original implementation. Now, when we compile x.field, the left hand side’s type is right there in the AST. We can then look up its type in the type definitions, which will give us a map from fields to their offsets.

When string and array literals are compiled, their values have to be put into global data since they’re pointers. This leads to the somewhat awkward restriction that array literals can only contain “simple” values like ints, chars, booleans, etc. and not even do things like reference variables or use function calls.

After everything is fully compiled, a function is used to turn the code contained within the Program to a string containing assembly code.

Assembling

Assembling works mostly as you’d expect, taking the instruction mnemonics and arguments and turning them into bytes. But the result of assembling is not yet a runnable executable, instead just an object file with the labels left unresolved until linking.

Linking

Linking objects together was something very hard for me to figure out how to do as there are very few tutorials on the Internet or in books about how to write a linker. Even something like object file formats are not discussed very much, and looking at real formats doesn’t help since they’re all way too complex and contain features I don’t need.

This comment from common/object.ml describes the relatively simple object format I ended up using:

uint16 number of strings
uint16 length of symbol table
uint16 length of relocation table

strings:
uint8 length
uint8 char
...
length
char
...
length
char
...

symbol table:
uint16 index (into the strings)
uint16 offset (where in the code is it defined, what is its actual value?)
index
offset
index
offset
...

relocation table:
uint16 offset (in the code where the label is referenced)
uint16 index
...

code:
desp 1
loco 0
stol 0
lodl 2
jzer mulEnd (all labels are replaced with 0; if an address genuinely is a number it just wont have a table entry)
subd c1
...

It starts with 3 numbers that give the number of labels, which are stored as strings in a Pascal-like length-prefixed format in order to make reading them from bytes easier, the size of the symbol table, which stores the definitions of labels and their addresses, and then the size of the relocation table, which gives the addresses where those labels are used or referenced. After the strings, symbol table, and relocation table comes the assembled machine code. Whenever a label is referenced in the code, its value is replaced here with all 0s, which will be patched in when the final executable is produced.

When linking 2 objects, we create a new object that’s initially equal to the 1st object given (so order does matter). For each (label, address) pair in the 2nd object’s symbol table, we simply add the size of the 1st object’s code to the address and then add it to the shared table. This has the consequence of overwriting any earlier defined labels with the same name, making object-local labels possible. For the relocation table, we do a similar thing: each (address, label) pair gets the size of the 1st object’s code adding to the address. As for the code itself, the bytes of the 1st object’s code and 2nd object’s are just concatenated together. After the combined object is created, we patch in the possible relocations. When a label has been defined, its address is patched into all the instructions that use it. Then, those entries are removed from the relocation table so that they don’t get patched again.

To link n objects, we essentially just fold over the list of objects: link obj1 obj2 obj3 ... objn = link2 (link2 (link2 (link2 obj1 obj2) obj3) ...) objn. When the final object is produced, if there are any entries left in the relocation table, that means there are labels used that haven’t been defined. That’s an error, and we’ll point out to the user that the object wasn’t able to be fully resolved due to undefined labels.

When running the Mickey script, it’ll automatically link the object you generated with the prelude, which contains assembly functions necessary for the whole thing to work, the standard library, which is actually written entirely in Mickey and includes things like a memory allocator, I/O functions, some primitive error handling via abort, a function for converting ints to strings, and a neat function called heap_report that lets you see the status of all the blocks in the heap, including how big they are, whether they’re being used or not, and what data they contain. The free function even consolidates unused blocks together so that you don’t have to worry about fragmentation.

Simulating

The simulator reads in the executable and runs it instruction by instruction. Unlike the UML simulator, it’s not microcode-level and has no support for custom microcode. When the HALT instruction is encountered, it puts the user into a debugger where they can do things like get values from registers and read from/write to memory.

To support I/O, the simulator state has user-addable read/write callbacks, which are functions that get called when certain addresses are read to or written from. A buffer of inputted characters can be used to simulate asynchronous, non-blocking input to some degree, and that’s what this simulator uses.

Evil Hangman: is it possible?

The entire goal of this project, as I said previously, was to see if Evil Hangman could be ported to MIC-1. As of writing this post, I haven’t done that yet, and honestly I’m a little skeptical if it is, quite simply because the MIC-1 is limited to 4 KB of memory. Writing a simple program that uses a binary search tree to sort a list of numbers already takes up more than half of that. Of course, there’s also the fact that Evil Hangman would need a long list of strings for the guessable words, as well as heap space for the various allocated structures like AVL trees and vectors of strings. With the generated code taking up so much space, clearly I would need to add some heavy optimizations to Mickey to make this feasible. In conclusion, Evil Hangman is theoretically possible to implement in Mickey, but I feel like the way to implement it with the greatest chance of fitting in 4 KB would be via purely hand-written assembly. Writing it in C was already bad enough, I’m definitely not doing that.

I think Mickey is truly finished now. With the previous implementation, I got to a point that I thought was good enough at the time but wasn’t truly pleased with it overall: I really wanted custom data types. I’ve added that and more, and I think Mickey is now an actually usable language with most of the things you’d expect from a low-level language like this… except for anything to do with bit manipulation, because the MIC-1 architecture doesn’t have any instructions for those and they’re very hard to do purely arithmetically. In case you want to see the insanity involved in literally just trying to flip 1 bit, take a look at the alloc and free functions in stdlib.mky.