9.2 KiB
title, description
| title | description |
|---|---|
| Compilation Pipeline | Overview of the compilation stages and optimizations |
Overview
The compilation pipeline transforms source code through several stages, each adding information or lowering the representation toward execution. All backends share the same path through mcode and streamline.
Source → Tokenize → Parse → Fold → Mcode → Streamline → Machine
The final machine stage has two targets:
- Mach VM — a register-based bytecode interpreter that directly executes the mcode instruction set as compact 32-bit binary
- Native code — lowers mcode to QBE or LLVM intermediate language, then compiles to machine code for the target CPU architecture
Stages
Tokenize (tokenize.cm)
Splits source text into tokens. Handles string interpolation by re-tokenizing template literal contents. Produces a token array with position information (line, column).
Parse (parse.cm)
Converts tokens into an AST. Also performs semantic analysis:
- Scope records: For each scope (global, function), builds a record mapping variable names to their metadata:
make(var/def/function/input),function_nr,nr_uses,closureflag, andlevel. - Type tags: When the right-hand side of a
defis a syntactically obvious type, stampstype_tagon the scope record entry. Derivable types:"integer","number","text","array","record","function","logical". Fordefvariables, type tags are also inferred from usage patterns: push (x[] = v) implies array, property access (x.foo = v) implies record, integer key implies array, text key implies record. - Type error detection: For
defvariables with known type tags, provably wrong operations are reported as compile errors: property access on arrays, push on non-arrays, text keys on arrays, integer keys on records. Onlydefvariables are checked becausevarcan be reassigned. - Intrinsic resolution: Names used but not locally bound are recorded in
ast.intrinsics. Name nodes referencing intrinsics getintrinsic: true. - Access kind: Subscript (
[) nodes getaccess_kind:"index"for numeric subscripts,"field"for string subscripts, omitted otherwise. - Tail position: Return statements where the expression is a call get
tail: true.
Fold (fold.cm)
Operates on the AST. Performs constant folding and type analysis:
- Constant folding: Evaluates arithmetic on known constants at compile time (e.g.,
5 + 10becomes15). - Constant propagation: Tracks
defbindings whose values are known constants. - Type propagation: Extends
type_tagthrough operations. When both operands of an arithmetic op have known types, the result type is known. Propagates type tags to reference sites. - Intrinsic specialization: When an intrinsic call's argument types are known, stamps a
hinton the call node. For example,length(x)where x is a known array getshint: "array_length". Type checks likeis_array(known_array)are folded totrue. - Purity analysis: Expressions with no side effects are marked pure (literals, name references, arithmetic on pure operands, calls to pure intrinsics). The pure intrinsic set contains only
is_*sensory functions — they are the only intrinsics guaranteed to never disrupt regardless of argument types. Other intrinsics liketext,number, andlengthcan disrupt on wrong argument types and are excluded. - Dead code elimination: Removes unreachable branches when conditions are known constants. Removes unused
var/defdeclarations with pure initializers. Removes standalone calls to pure intrinsics where the result is discarded.
Mcode (mcode.cm)
Lowers the AST to a JSON-based intermediate representation with explicit operations. Key design principle: every type check is an explicit instruction so downstream optimizers can see and eliminate them.
- Typed load/store: Emits
load_index(array by integer),load_field(record by string), orload_dynamic(unknown) based on type information from fold. - Decomposed calls: Function calls are split into
frame(create call frame) +setarg(set arguments) +invoke(execute call). - Intrinsic access: Intrinsic functions are loaded via
accesswith an intrinsic marker rather than global lookup. - Intrinsic inlining: Type-check intrinsics (
is_array,is_text,is_number,is_integer,is_logical,is_null,is_function,is_object,is_stone),length, andpushare emitted as direct opcodes instead of frame/setarg/invoke call sequences. - Disruption handler labels: When a function has a disruption handler, a label is emitted before the handler code. This allows the streamline optimizer's unreachable code elimination to safely nop dead code after
returnwithout accidentally eliminating the handler. - Tail call marking: When a return statement's expression is a call and the function has no disruption handler, the final
invokeis renamed totail_invoke. This marks the call site for future tail call optimization. Functions with disruption handlers cannot use TCO because the handler frame must remain on the stack.
See Mcode IR for the instruction format and complete instruction reference.
Streamline (streamline.cm)
Optimizes the Mcode IR through a series of independent passes. Operates per-function:
- Backward type inference: Infers parameter types from how they are used in typed operators (
add_int,store_index,load_field,push,pop, etc.). Immutabledefparameters keep their inferred type across label join points. - Type-check elimination: When a slot's type is known, eliminates
is_<type>+ conditional jump pairs. Narrowsload_dynamic/store_dynamicto typed variants. - Algebraic simplification: Rewrites identity operations (add 0, multiply 1, divide 1) and folds same-slot comparisons.
- Boolean simplification: Fuses
not+ conditional jump into a single jump with inverted condition. - Move elimination: Removes self-moves (
move a, a). - Unreachable elimination: Nops dead code after
returnuntil the next label. - Dead jump elimination: Removes jumps to the immediately following label.
See Streamline Optimizer for detailed pass descriptions.
Machine
The streamlined mcode is lowered to a machine target for execution.
Mach VM (default)
The Mach VM is a register-based virtual machine that directly interprets the mcode instruction set as 32-bit binary bytecode. The Mach serializer (mach.c) converts streamlined mcode JSON into compact 32-bit instructions with a constant pool. Since the mach bytecode is a direct encoding of the mcode, the Mcode IR reference serves as the authoritative instruction set documentation.
pit script.ce
Native Code (QBE / LLVM)
Lowers the streamlined mcode to QBE or LLVM intermediate language for compilation to native machine code. Each mcode function becomes a native function that calls into the ƿit runtime (cell_rt_* functions) for operations that require the runtime (allocation, intrinsic dispatch, etc.).
String constants are interned in a data section. Integer constants are encoded inline.
pit --emit-qbe script.ce > output.ssa
Boot Seeds
The boot/ directory contains pre-compiled mcode IR (JSON) seed files for the pipeline modules:
boot/tokenize.cm.mcode
boot/parse.cm.mcode
boot/fold.cm.mcode
boot/mcode.cm.mcode
boot/streamline.cm.mcode
boot/bootstrap.cm.mcode
Seeds are used during cold start (empty cache) to compile the pipeline modules from source. The engine's load_pipeline_module() hashes the source file content — if the source changes, the hash changes, the cache misses, and the module is recompiled from source using the boot seeds. This means:
- Editing a pipeline module (e.g.
tokenize.cm) takes effect on the next run automatically - Seeds only need regenerating if the pipeline changes in a way the existing seeds can't compile the new source, or before distribution
- Use
pit seedto regenerate all seeds, andpit seed --cleanto also clear the build cache
Files
| File | Role |
|---|---|
tokenize.cm |
Lexer |
parse.cm |
Parser + semantic analysis |
fold.cm |
Constant folding + type analysis |
mcode.cm |
AST → Mcode IR lowering |
streamline.cm |
Mcode IR optimizer |
qbe_emit.cm |
Mcode IR → QBE IL emitter |
qbe.cm |
QBE IL operation templates |
internal/bootstrap.cm |
Cache seeder (cold start only) |
internal/engine.cm |
Self-sufficient pipeline loader and orchestrator |
Debug Tools
| File | Purpose |
|---|---|
dump_mcode.cm |
Print raw Mcode IR before streamlining |
dump_stream.cm |
Print IR after streamlining with before/after stats |
dump_types.cm |
Print streamlined IR with type annotations |
Test Files
| File | Tests |
|---|---|
parse_test.ce |
Type tags, access_kind, intrinsic resolution |
fold_test.ce |
Type propagation, purity, intrinsic hints |
mcode_test.ce |
Typed load/store, decomposed calls |
streamline_test.ce |
Optimization counts, IR before/after |
qbe_test.ce |
End-to-end QBE IL generation |
test_intrinsics.cm |
Inlined intrinsic opcodes (is_array, length, push, etc.) |
test_backward.cm |
Backward type propagation for parameters |