Created a C Compiler, in Java from scratch. This included lexer, parser, AST generator, semantic analyzer, and regalloc. This was an individual project. Some boilerplate code by Christophe Dubach.
Sadly, as this is for a class at McGill, and the course content is static, it is not open source.
A compiler is actually just a massive collection of very straightforward things. There weren't too many crazy algorithms or bits of Java syntax required.
Writing such a large scale program myself gave me practice with abstracting and focusing on different parts of a program at a time, and learning where to look when debugging. In contrast to a larger, group project where it's probably someone else's fault if something was broken outside of the section I'm currently working in, here I had to switch gears if I discovered a bug in something previously written.
Since register allocation is really just graph building and coloring, I got lots of hands-on experience with graph related algorithms. It was also nice to apply theoretical syntax, semantic, and typing rules I've encountered in other classes (although they can be much trickier than they are here).
This project was a big confidence builder in my coding abilities.
Obviously, I learned a lot about how Compilers function behind the scenes.
Surprisingly, I did not learn anything new about C, since I was not actually writing the compiler in C.
By far the most simple part of the compiler, the Lexer generates tokens from input text.
The Tokenizer was basically a long switch statement with checks: whitespace had to be ignored, and comment strings recognized.
The Token object contains additional information, like position and character type (char, int).
A simple Scanner was created, feeding data into the Tokenizer, which then created Token objects.
Verify the syntax of the EBNF Grammar representing a subset of C - "Mini C"
# ::=
# () grouping
# [] optional
# {} zero or more
# + one or more
# | alternative
program ::= (include)* (structdecl)* (vardecl)* (funcdecl)* EOF
include ::= "#include" STRING_LITERAL
structdecl ::= structtype "{" (vardecl)* "}" ";" # structure declaration
vardecl ::= type IDENT ";"
| type IDENT "[" INT_LITERAL "]" ";" # array declaration, e.g. int a[2];
funcdecl ::= type IDENT "(" params ")" block # function declaration
type ::= ("int" | "char" | "void" | structtype) ["*"]
structtype ::= "struct" IDENT
params ::= [ type IDENT ("," type IDENT)* ]
stmt ::= block
| "while" "(" exp ")" stmt # while loop
| "if" "(" exp ")" stmt ["else" stmt] # if then else
| "return" [exp] ";"
| exp "=" exp ";"
| exp ";"
block ::= "{" (vardecl)* (stmt)* "}"
exp ::= "(" exp ")"
| (IDENT | INT_LITERAL)
| ("-" | "!") exp
| CHARLITERAL
| STRING_LITERAL
| exp ("*" | "/" | "%" | "+" | "-" | "<" | ">" | "<=" | ">=" | "==" | "!=" | "&&" | "||") exp # binary operators
| arrayaccess | fieldaccess | valueat | addressof | funcall | sizeof | typecast
funcall ::= IDENT "(" [ exp ("," exp)* ] ")" # function call
arrayaccess ::= exp "[" exp "]" # array access
fieldaccess ::= exp "." IDENT # structure field member access
valueat ::= "*" exp # Value at operator (pointer indirection)
addressof ::= "&" exp # Address-of operator
sizeof ::= "sizeof" "(" type ")" # size of type
typecast ::= "(" type ")" exp # type casting
| Precedence | Operator | Description | Associativity |
|------------|----------|-------------------------------------|-----------------|
| 1 | () | Function call | Left-to-right |
| 1 | [] | Array subscripting | Left-to-right |
| 1 | . | Structure member access | Left-to-right |
| 2 | + | Unary plus | Right-to-left |
| 2 | - | Unary minus | Right-to-left |
| 2 | (type) | Type cast | Right-to-left |
| 2 | * | Indirection | Right-to-left |
| 2 | & | Address of | Right-to-left |
| 3 | * / % | Multiplication, division, remainder | Left-to-right |
| 4 | + - | Addition, subtraction | Left-to-right |
| 5 | < <= > >=| Relational operators | Left-to-right |
| 6 | == != | Relational operators | Left-to-right |
| 7 | && | Logical AND | Left-to-right |
| 8 | \|\| | Logical OR | Left-to-right |
A recursive descent parser was used...
private Program parseProgram() {
parseIncludes();
// THESE ARE ALL LOOPS
List stds = parseStructDecls();
List vds = parseVarDeclsRep();
List fds = parseFunDecls();
expect(TokenClass.EOF);
return new Program(stds, vds, fds);
}
The parser checks syntax and stores parsed info in an AST...
// The program AST node (a list of struct type declarations, list of variable declarations, list of FunDecl definitions)
Program ::= StructTypeDecl* VarDecl* FunDecl*
// Types
Type ::= BaseType | PointerType | StructType | ArrayType
BaseType ::= INT | CHAR | VOID
PointerType ::= Type "*" // use to represent pointers to other types
StructType ::= String // represent a struct type (the String is the name of the declared struct type)
ArrayType ::= Type int // Type represents the element type, int represents the number of elements (number of elements)
// Struct declaration
StructTypeDecl ::= StructType VarDecl*
// Variable declaration
VarDecl ::= Type String
// FunDecl definition (the String is the name of the FunDecl)
FunDecl ::= Type String VarDecl* Block
// Expressions
Expr ::= IntLiteral | StrLiteral | CharLiteral | VarExpr | FunCallExpr
| BinOp | ArrayAccessExpr | FieldAccessExpr | ValueAtExpr
| AddressOfExpr | SizeOfExpr | TypecastExpr
// Literals
IntLiteral ::= int // int stores the value of the integer
StrLiteral ::= String // String stores the value of the String
CharLiteral ::= char // char stores the value of the character
// Variable (the String is the name of the variable)
VarExpr ::= String
// Function call (the String corresponds to the name of the function to call and the Expr* is the list of arguments)
FunCallExpr ::= String Expr*
// Binary operations
BinOp ::= Expr Op Expr
Op ::= ADD | SUB | MUL | DIV | MOD | LT | GT | LE | GE | NE | EQ | OR | AND
// Array access expression : Expr[Expr] (e.g. a[10])
ArrayAccessExpr ::= Expr Expr // the first Expr is the array, the second one the index
// Field access expression : Expr.String (e.g. a.b)
FieldAccessExpr ::= Expr String // the Expr represents the structure, the String represents the name of the field
// Value at expression : *Expr (e.g. *p)
ValueAtExpr ::= Expr
// Address of an expression : &Expr (e.g. &a[1])
AddressOfExpr ::= Expr
// Sizeof expression : sizeof(Type) (e.g. sizeof(int*))
SizeOfExpr ::= Type
// Typecast expression : (Type)Expr (e.g. (int*)malloc(4))
TypecastExpr ::= Type Expr
// Statements
Stmt ::= Block | While | If | Assign | Return | ExprStmt
// An expression statement (e.g. x=2;)
ExprStmt ::= Expr
// While loop statement : while (Expr) Stmt;
While ::= Expr Stmt
// If statement: if (Expr) Stmt1 else Stmt2; (if the second Stmt is null, this means there is no else part)
If ::= Expr Stmt [Stmt]
// Assignment statement : Expr = Expr; (e.g. x[5] = 2;)
Assign ::= Expr Expr
// Return statement : (the Expr is optional)
Return ::= [Expr]
// Block statement (starts with { and ends with } in the source code)
Block ::= VarDecl* Stmt*
All AST nodes are represented with Java classes...
ast
├── AddressOfExpr.java
├── ArrayAccessExpr.java
├── ArrayType.java
├── Assign.java
├── ASTNode.java
├── ASTPrinter.java
├── ASTVisitor.java
├── BaseType.java
├── BinOp.java
├── Block.java
├── ChrLiteral.java
├── Expr.java
├── ExprStmt.java
├── FieldAccessExpr.java
├── FunCallExpr.java
├── FunDecl.java
├── If.java
├── IntLiteral.java
├── Op.java
├── PointerType.java
├── Program.java
├── Return.java
├── SizeOfExpr.java
├── Stmt.java
├── StringLiteral.java
├── StructType.java
├── StructTypeDecl.java
├── Type.java
├── TypecastExpr.java
├── ValueAtExpr.java
├── VarDecl.java
├── VarExpr.java
└── While.java
The following analysis was done on the AST:
Name Analysis: Ensure that the scoping and visibility rules of the language are respected.
void print_s(char* s);
void print_i(int i);
void print_c(char c);
char read_c();
int read_i();
void* mcmalloc(int size);
After verifying the semantics, code needs to be generated to execute the program. The program is compiled into MIPS in order to use MARS, a cross platform MIPS simulator used for educational purposes.
We will now produce temporary MIPS code, using virtual registers. Virtual registers are representations of real registers, but there is no limit to how many can be used. They can be naively interacted with by saving/loading their contents onto the stack every time they are used, however this is very slow.
Global variables are statically stored in memory using the MIPS .data directive.
Local variables are stored in memory on the stack, using $sp manipulation.
Data padding is an important consideration, as sizeof(char) == 1, sizeof(int) == 4, sizeof(int*) == 4, but all fields must align at a 4 byte (32-bit) boundary.
Branching is done by converting conditionals into a combination of j, b, and bne instructions.
Struct/Array access and assignment is done using lw and sw instructions and considering an offset.
Function calls are done using a j label instruction, and pushing $s and $rp onto the stack so they are not overwritten.
Stdlib functions are performed using MIPS syscall.
Local, static variables (that never have an & operation called on them), are represented as virtual registers to allow them to be optimized in register allocation. Only the variables that need to be in memory are ever in memory, the ones that can be handled through registers are, and the stack is avoided when possible.
Now that we have a complete MIPS program using virtual registers, we need to allocate real registers to our program. MIPS provides 18 variable registers $s0–s9, $t0–t7, where s-registers are guaranteed to be saved after a function call. MIPS also includes system registers $sp, $ip, $mul, etc.
In order to keep track of what is being executed when, a Control-Flow Graph (CFG) is created.
Liveness Analysis is performed on the CFG, with each node of the CFG keeping a liveIn and liveOut set. This information is then used to create an Interference Graph, where edges represent a point in the program execution where both virtual registers are alive at the same time.
Note that the CFG nodes are instructions, the Interference Graph nodes are virtual registers. Big difference!
Finally I implemented Chaitin's Algorithm for Graph Colouring, which includes a spill function, where if we were unable to properly colour the interference graph, spill is saved onto the stack.