Compiler internals

Original article.
In this blog post I'm going to dig deeper into the compiler architecture an internals of the current version. This article can be helpful for people interested in programming languages and compiler internals but also for potential BL compiler contributors.

Build pipeline

Compilation process is consist of various compilation stages organized in one big pipeline. Every stage basically do some job and pass the result to next stage. Whole process is driven by builder. The builder creates assembly and then unit for every compiled file. Every unit is then passed to compilation process.

We can use simple hello world program as an example:

    main :: fn () s32 {
        print("Hello world!\n");
        return 0;
    }

1. File loader stage

This one is very first stage which loads passed source file into current compiled unit. There is nothing special about this stage.

2. Lexer stage

The lexer scans every single character in source code and creates tokens. You can use -lex-dump parameter to see the lexer output. Every token is recognized pattern in source code with some meaning (number, identifier, operator, text...). Another important job of the lexer is store information about token position in source code file, this information can be eventually used later when compiler needs to report some error and give a hit to programmer what is wrong and where.

Output of lexer:

    Tokens:
    3: ['identifier' 3:1], [':' 3:6], [':' 3:7], ['fn' 3:9], ['(' 3:12], [')' 3:13], 
       ['identifier' 3:15], ['{' 3:19],
    4: ['identifier' 4:5], ['(' 4:10], ['string' 4:12], [')' 4:27], [';' 4:28],
    5: ['return' 5:5], ['number' 5:12], [';' 5:13],
    6: ['}' 6:1],
    7: ['end' 7:1],

3.Parser stage

In this stage we take created sequence of tokens and organize them into the AST. Use -ast-dump parameter to see output of the parser. The parser defines language syntax and can produce various set of errors when processed source is incorrect. AST code representation can be passed to the next stage only when everything was parsed correctly.

Output of parser:

    UBlock <IMPLICIT> 0x7fc824814a50 test.bl
      DeclEntity <3:1> 0x7fc824814ae0 'main' 'immutable'
        ExprLitFn <3:9> 0x7fc824814b20
          Block <3:19> 0x7fc824814c40
            ExprCall <4:10> 0x7fc824814d20
              ExprRef <4:5> 0x7fc824814cd0 'print'
              ExprLitString <4:12> 0x7fc824814d60 Hello world!
            StmtReturn <5:5> 0x7fc824814db0
              ExprLitInt <5:12> 0x7fc824814df0 0

4. MIR generation stage

This stage takes care of MIR generation from AST. The MIR(Middle Intermediate Representation) is an internal bytecode language similar to LLVM IR. MIR code is generated here, main goal of this stage is to make output code more flat and remove all syntax sugar. Basically every single language feature is split into quite small MIR instruction set. One important thing about MIR language is data type processing, we don't have any relations between types and values yet (this is done in analyze stage later) but we can use types in the same way as values here.

Let's say we are going to declare variable:

1 2	my_variable: s32;

Then my_variable is some declaration of type named s32, we know that s32 should be a type, but we don't have any other information in this stage. MIR generator produce simple sequence of instructions for this declaration:

    /* syntax: %<id> <return type> <instruction name> [args] */
    ...
    %16           <unknown> declref s32            /* find declaration named 's32' */
    %17                void decl my_variable : %16 /* declare new symbol 'my_variable' of type %16*/
    ...

Now we have skeleton of this declaration, in next stage we can execute instruction declref to get reference to s32 and create new variable.

MIR code can be exported into the file by -emit-mir compiler parameter, our non-analyzed hello world looks like this:

    @main : <unknown> : {
    /* This is entry bacic block.*/
    %entry_7: 
        /* Declaration of temporary for return value. */
        %9                 void decl .ret.0 : <unknown>
        
        /* Hello world text as constant string. */
        %12              string const {13, "Hello world!"} 
        
        /* We want to find 'print' function.*/
        %13           <unknown> declref print 
        
        /* Call the print. */
        %14           <unknown> call %13(%12)
        
        /* 
         * This is bolerplate generated by 'return 0;'.
         * We store constant 0 into the ret.0 temporary and continue to 'exit_8' block.
         */
        %15                 s32 const 0 
        %16           <unknown> declref %9 
        %17                void store %15 -> %16
        %18                void br %exit_8 

    %exit_8:
        %10           <unknown> declref %9
        %11                void ret %10
    }

    /* 
     * This is type resolver function produced for 'fn () s32' type of the 'main' function. 
     * It's never going to be exported to the final executable, but it's executed in next
     * stage. Return type of this function is Type and value is actual function type of
     * the 'main'. It looks like an over-engineering to have separate function for this, but
     * keep in mind that function types can be more complex...
     */
    @.type : fn() type : {
    %entry_2:
        %3            <unknown> declref s32
        %4                 type const fn () %3
        %5                 void ret %4
    }

5. Analyze stage

This is the most important and most complicated stage of the compiler. Main goal of analyze stage is resolve all data types, variables and functions and connect everything together... Analyzer operates on MIR code from previous stage. It iterates over instructions one by one and call analyze_instr function. There are more possible results of this function call defined in following enum:

    typedef enum {
	    /* Analyze pass failed. */
	    ANALYZE_FAILED = 0,

	    /* Analyze pass passed. */
	    ANALYZE_PASSED = 1,

	    /* Analyze pass cannot be done because some of sub-parts has not been
	    analyzed yet and probably needs to be executed during analyze pass. In
	    such case we push analyzed instruction at the end of analyze queue. */
	    ANALYZE_POSTPONE = 2,

	    /* In this case AnalyzeResult will contain hash of desired symbol which be satisfied later,
	    instruction is pushed into waiting table. */
	    ANALYZE_WAITING = 3,
    } AnalyzeState;

All analyzed instructions are organized into analyze queue and analyzer choose what to do based on analyze result. When result is ANALYZE_PASSED we can simply remove instruction from queue, but when analyze finishes with ANALYZE_POSTPONE for example, we put current instruction at the end of the queue. With the last state ANALYZE_WAITING analyzer must get also hash of the desired symbol which is needed to complete analyze pass of current instruction. In such case we don't put current instruction back to the queue but store it into hash table of waiting queues assigned to the missing symbol. Later when missing symbol appears as completed every waiting instruction is going to be analyzed again.

Our hello world after analyze stage passed looks like this:

    /* analyzed */
    @main : fn() s32 : {
    %entry_7:
        %9                 void decl .ret.0 : s32
        %8882            ...Any vargs Any {}
        %14                 s32 call @print({13, "Hello world!"}, %8882)
        %16                *s32 declref %9 /* direct */
        %17                void store 0 -> %16
        %18                void br %exit_8

    %exit_8:
        %10                *s32 declref %9 /* direct */
        %8883               s32 load %10
        %11                void ret %8883
    } /* comptime */

As you can see all <unknown> types are replaced by actual type. There are also bunch of automatically generated instructions like vargs and load, those are generated during analyze pass. Function print needs format string and vargs as arguments so the input was slightly modified. Also load instruction was generated for ret.

Another interesting thing is our "hello world" string representation here. In previous stage this string was represented as separate string const {13, "Hello world!"} instruction. Analyzer took this instruction and mark it as comptime. Comptime instructions can be directly executed and replaced by execution result constant during analyze since they are directly known and has no side-effects in runtime.

6. VM execution stage

This stage is optional and invoked only when we are going to execute main in compile-time. To do so use -run compiler flag. When MIR bytecode is fully analyzed it can be executed by BL interpreter in this stage even before we produce any native binary or LLVM IR. Calls to external libraries are invoked via dyncall library.

The interpreter use preallocated stack to store all variables and temporary data passed between instructions. For stack data manipulation we use PUSH and POP operations. Execution of our hello world program is not so trivial since we use print function but here you can see begin of main function execution:

      Executing 'main' in compile time...
         -             Terminal  PUSH RA
         -                       PUSH    (4B, 0x1083d1b18) s32
         -                       PUSH    (16B, 0x1083d1b28) ...Any
      8882           InstrVArgs  PUSH    (16B, 0x1083d1b48) ...Any
        14            InstrCall  PUSH RA
        14            InstrCall  PUSH    (4B, 0x1083d1b88) s32
        14            InstrCall  PUSH    (16B, 0x1083d1b98) ...Any
        14            InstrCall  PUSH    (16B, 0x1083d1bb8) string
	...

7. LLVM IR generation stage

This stage produce LLVM IR from MIR code. Our MIR code is designed to be simply translated into LLVM's one, so this stage only iterates over all instructions and generate LLVM equivalents via C API. We're not going to describe LLVM syntax here, but as you can see in example code, the LLVM's IR looks quite similar as MIR. We have main function, two blocks, store, load, etc.

    define i32 @main() {
    entry:
        %0 = alloca i32, align 4
        %1 = alloca { i64, %Any* }, align 8
        %2 = getelementptr inbounds { i64, %Any* }, { i64, %Any* }* %1, i32 0, i32 0
        store i64 0, i64* %2
        %3 = getelementptr inbounds { i64, %Any* }, { i64, %Any* }* %1, i32 0, i32 1
        store %Any* null, %Any** %3
        %4 = alloca %string
        %5 = load { i64, %Any* }, { i64, %Any* }* %1
        store %string 
            { i64 13, i8* getelementptr inbounds ([14 x i8], [14 x i8]* @.str.16, i32 0, i32 0) }, %string* %4
        %6 = bitcast %string* %4 to { i64, i64 }*
        %7 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %6, i32 0, i32 0
        %8 = load i64, i64* %7
        %9 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %6, i32 0, i32 1
        %10 = alloca { i64, %Any* }
        %11 = load i64, i64* %9
        store { i64, %Any* } %5, { i64, %Any* }* %10
        %12 = bitcast { i64, %Any* }* %10 to { i64, i64 }*
        %13 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %12, i32 0, i32 0
        %14 = load i64, i64* %13
        %15 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %12, i32 0, i32 1
        %16 = load i64, i64* %15
        %17 = call i32 @print(i64 %8, i64 %11, i64 %14, i64 %16)
        store i32 0, i32* %0, align 4
        br label %exit

    exit:                                             ; preds = %entry
        %18 = load i32, i32* %0, align 4
        ret i32 %18
    }

8. Write object file stage

In this stage we take generated LLVM IR and produce object file with use of LLVM C API. This file can be later passed to native linker.

9. Liker stage

This is the last stage, here we link everything together. Platform specific linker is called with our generated object file and all its dependencies to produce executable.

Conclusion

Compilers are hard, but when we divide complex problems into smaller ones it's not so hard as it seems to be. However it's still lot of programming.

Useful links

- MIR documentation
- Compiler source