C Compiler

McGill University 2021

Created a C Compiler, in Java from scratch. This included lexer, parser, AST generator, semantic analyzer, and regalloc. This was an individual project. Some boilerplate code by Christophe Dubach.

Sadly, as this is for a class at McGill, and the course content is static, it is not open source.

Takeaways

A compiler is actually just a massive collection of very straightforward things. There weren't too many crazy algorithms or bits of Java syntax required.

Writing such a large scale program myself gave me practice with abstracting and focusing on different parts of a program at a time, and learning where to look when debugging. In contrast to a larger, group project where it's probably someone else's fault if something was broken outside of the section I'm currently working in, here I had to switch gears if I discovered a bug in something previously written.

Since register allocation is really just graph building and coloring, I got lots of hands-on experience with graph related algorithms. It was also nice to apply theoretical syntax, semantic, and typing rules I've encountered in other classes (although they can be much trickier than they are here).

This project was a big confidence builder in my coding abilities.
Obviously, I learned a lot about how Compilers function behind the scenes.
Surprisingly, I did not learn anything new about C, since I was not actually writing the compiler in C.

What's Going On

Lexer

By far the most simple part of the compiler, the Lexer generates tokens from input text.

The Tokenizer was basically a long switch statement with checks: whitespace had to be ignored, and comment strings recognized.

The Token object contains additional information, like position and character type (char, int).

A simple Scanner was created, feeding data into the Tokenizer, which then created Token objects.

Parser

Verify the syntax of the EBNF Grammar representing a subset of C - "Mini C"

# ::=
        # () grouping
        # [] optional
        # {} zero or more
        # + one or more
        # | alternative
        
        program   ::= (include)* (structdecl)* (vardecl)* (funcdecl)* EOF
        
        include   ::= "#include" STRING_LITERAL
        
        structdecl ::= structtype "{" (vardecl)* "}" ";"   # structure declaration
        
        vardecl   ::= type IDENT ";"
                     | type IDENT "[" INT_LITERAL "]" ";"  # array declaration, e.g. int a[2];
        
        funcdecl  ::= type IDENT "(" params ")" block      # function declaration
        
        type      ::= ("int" | "char" | "void" | structtype) ["*"]
        structtype ::= "struct" IDENT
        
        params    ::= [ type IDENT ("," type IDENT)* ]
        
        stmt      ::= block
                     | "while" "(" exp ")" stmt            # while loop
                     | "if" "(" exp ")" stmt ["else" stmt] # if then else
                     | "return" [exp] ";"
                     | exp "=" exp ";"
                     | exp ";"
        
        block     ::= "{" (vardecl)* (stmt)* "}"
        
        exp       ::= "(" exp ")"
                     | (IDENT | INT_LITERAL)
                     | ("-" | "!") exp
                     | CHARLITERAL
                     | STRING_LITERAL
                     | exp ("*" | "/" | "%" | "+" | "-" | "<" | ">" | "<=" | ">=" | "==" | "!=" | "&&" | "||") exp  # binary operators
                     | arrayaccess | fieldaccess | valueat | addressof | funcall | sizeof | typecast
        
        funcall   ::= IDENT "(" [ exp ("," exp)* ] ")"     # function call
        arrayaccess ::= exp "[" exp "]"                    # array access
        fieldaccess ::= exp "." IDENT                      # structure field member access
        valueat   ::= "*" exp                              # Value at operator (pointer indirection)
        addressof ::= "&" exp                              # Address-of operator
        sizeof    ::= "sizeof" "(" type ")"                # size of type
        typecast  ::= "(" type ")" exp                     # type casting
        
| Precedence | Operator | Description                         | Associativity   |
        |------------|----------|-------------------------------------|-----------------|
        | 1          | ()       | Function call                       | Left-to-right   |
        | 1          | []       | Array subscripting                  | Left-to-right   |
        | 1          | .        | Structure member access             | Left-to-right   |
        | 2          | +        | Unary plus                          | Right-to-left   |
        | 2          | -        | Unary minus                         | Right-to-left   |
        | 2          | (type)   | Type cast                           | Right-to-left   |
        | 2          | *        | Indirection                         | Right-to-left   |
        | 2          | &        | Address of                          | Right-to-left   |
        | 3          | * / %    | Multiplication, division, remainder | Left-to-right   |
        | 4          | + -      | Addition, subtraction               | Left-to-right   |
        | 5          | < <= > >=| Relational operators                | Left-to-right   |
        | 6          | == !=    | Relational operators                | Left-to-right   |
        | 7          | &&       | Logical AND                         | Left-to-right   |
        | 8          | \|\|     | Logical OR                          | Left-to-right   |
        

A recursive descent parser was used...

private Program parseProgram() {
        parseIncludes();
    
        // THESE ARE ALL LOOPS
        List stds = parseStructDecls();
        List vds = parseVarDeclsRep();
        List fds = parseFunDecls();
    
        expect(TokenClass.EOF);
        return new Program(stds, vds, fds);
    }
    

The parser checks syntax and stores parsed info in an AST...

AST

// The program AST node (a list of struct type declarations, list of variable declarations, list of FunDecl definitions)
        Program ::= StructTypeDecl* VarDecl* FunDecl*
        
        // Types
        Type ::= BaseType | PointerType | StructType | ArrayType
        
        BaseType ::= INT | CHAR | VOID
        PointerType ::= Type "*"             // use to represent pointers to other types
        StructType ::= String                // represent a struct type (the String is the name of the declared struct type)
        ArrayType ::= Type int                // Type represents the element type, int represents the number of elements (number of elements)
        
        // Struct declaration
        StructTypeDecl ::= StructType VarDecl*
        
        // Variable declaration
        VarDecl ::= Type String
        
        // FunDecl definition (the String is the name of the FunDecl)
        FunDecl ::= Type String VarDecl* Block
        
        // Expressions
        Expr ::= IntLiteral | StrLiteral | CharLiteral | VarExpr | FunCallExpr 
               | BinOp | ArrayAccessExpr | FieldAccessExpr | ValueAtExpr 
               | AddressOfExpr | SizeOfExpr | TypecastExpr
        
        // Literals
        IntLiteral ::= int                 // int stores the value of the integer
        StrLiteral ::= String              // String stores the value of the String
        CharLiteral ::= char               // char stores the value of the character
        
        // Variable (the String is the name of the variable)
        VarExpr ::= String
        
        // Function call (the String corresponds to the name of the function to call and the Expr* is the list of arguments)
        FunCallExpr ::= String Expr*
        
        // Binary operations
        BinOp ::= Expr Op Expr
        Op ::= ADD | SUB | MUL | DIV | MOD | LT | GT | LE | GE | NE | EQ | OR | AND
        
        // Array access expression : Expr[Expr] (e.g. a[10])
        ArrayAccessExpr ::= Expr Expr       // the first Expr is the array, the second one the index
        
        // Field access expression : Expr.String (e.g. a.b)
        FieldAccessExpr ::= Expr String     // the Expr represents the structure, the String represents the name of the field
        
        // Value at expression : *Expr (e.g. *p)
        ValueAtExpr ::= Expr
        
        // Address of an expression : &Expr (e.g. &a[1])
        AddressOfExpr ::= Expr
        
        // Sizeof expression : sizeof(Type) (e.g. sizeof(int*))
        SizeOfExpr ::= Type
        
        // Typecast expression : (Type)Expr (e.g. (int*)malloc(4))
        TypecastExpr ::= Type Expr
        
        
        // Statements
        Stmt ::= Block | While | If | Assign | Return | ExprStmt
        
        // An expression statement (e.g. x=2;)
        ExprStmt ::= Expr
        
        // While loop statement : while (Expr) Stmt;
        While ::= Expr Stmt
        
        // If statement: if (Expr) Stmt1 else Stmt2; (if the second Stmt is null, this means there is no else part)
        If ::= Expr Stmt [Stmt]
        
        // Assignment statement : Expr = Expr; (e.g. x[5] = 2;)
        Assign ::= Expr Expr
        
        // Return statement : (the Expr is optional)
        Return ::= [Expr]
        
        // Block statement (starts with { and ends with } in the source code)
        Block ::= VarDecl* Stmt*
        

All AST nodes are represented with Java classes...

ast
        ├── AddressOfExpr.java
        ├── ArrayAccessExpr.java
        ├── ArrayType.java
        ├── Assign.java
        ├── ASTNode.java
        ├── ASTPrinter.java
        ├── ASTVisitor.java
        ├── BaseType.java
        ├── BinOp.java
        ├── Block.java
        ├── ChrLiteral.java
        ├── Expr.java
        ├── ExprStmt.java
        ├── FieldAccessExpr.java
        ├── FunCallExpr.java
        ├── FunDecl.java
        ├── If.java
        ├── IntLiteral.java
        ├── Op.java
        ├── PointerType.java
        ├── Program.java
        ├── Return.java
        ├── SizeOfExpr.java
        ├── Stmt.java
        ├── StringLiteral.java
        ├── StructType.java
        ├── StructTypeDecl.java
        ├── Type.java
        ├── TypecastExpr.java
        ├── ValueAtExpr.java
        ├── VarDecl.java
        ├── VarExpr.java
        └── While.java
        
The following analysis was done on the AST: Name Analysis: Ensure that the scoping and visibility rules of the language are respected.

Global and Local Scopes: Identifiers declared in the global scope can be accessed anywhere in the program, A variable declared in a block is accessible in the block and all inner blocks of that block, but not accessible outside the block. In both cases, it is illegal to declare twice the same identifiers in the same current block.

Shadowing: An identifier declared within a given scope has the same name as an identifier declared in an outer scope.

Built-in functions: void print_s(char* s); void print_i(int i); void print_c(char c); char read_c(); int read_i(); void* mcmalloc(int size);

Type Analysis: Using formal typing rules for Mini C to perform static type-checking.

L-value checking: lvalues (left-values) are the only expressions that can appear on the left-hand side of an assignment or as an argument to the address-of operator (&).

Code Generation

After verifying the semantics, code needs to be generated to execute the program. The program is compiled into MIPS in order to use MARS, a cross platform MIPS simulator used for educational purposes.

We will now produce temporary MIPS code, using virtual registers. Virtual registers are representations of real registers, but there is no limit to how many can be used. They can be naively interacted with by saving/loading their contents onto the stack every time they are used, however this is very slow.

Global variables are statically stored in memory using the MIPS .data directive.
Local variables are stored in memory on the stack, using $sp manipulation.

Data padding is an important consideration, as sizeof(char) == 1, sizeof(int) == 4, sizeof(int*) == 4, but all fields must align at a 4 byte (32-bit) boundary.

Branching is done by converting conditionals into a combination of j, b, and bne instructions.

Struct/Array access and assignment is done using lw and sw instructions and considering an offset.

Function calls are done using a j label instruction, and pushing $s and $rp onto the stack so they are not overwritten.

Stdlib functions are performed using MIPS syscall.

Local, static variables (that never have an & operation called on them), are represented as virtual registers to allow them to be optimized in register allocation. Only the variables that need to be in memory are ever in memory, the ones that can be handled through registers are, and the stack is avoided when possible.

Register Allocation

Now that we have a complete MIPS program using virtual registers, we need to allocate real registers to our program. MIPS provides 18 variable registers $s0–s9, $t0–t7, where s-registers are guaranteed to be saved after a function call. MIPS also includes system registers $sp, $ip, $mul, etc.

In order to keep track of what is being executed when, a Control-Flow Graph (CFG) is created.

Liveness Analysis is performed on the CFG, with each node of the CFG keeping a liveIn and liveOut set. This information is then used to create an Interference Graph, where edges represent a point in the program execution where both virtual registers are alive at the same time.

Note that the CFG nodes are instructions, the Interference Graph nodes are virtual registers. Big difference!

Finally I implemented Chaitin's Algorithm for Graph Colouring, which includes a spill function, where if we were unable to properly colour the interference graph, spill is saved onto the stack.