--- title: "Compilation Pipeline" description: "Overview of the compilation stages and optimizations" --- ## Overview The compilation pipeline transforms source code through several stages, each adding information or lowering the representation toward execution. All backends share the same path through mcode and streamline. ``` Source → Tokenize → Parse → Fold → Mcode → Streamline → Machine ``` The final **machine** stage has two targets: - **Mach VM** — a register-based bytecode interpreter that directly executes the mcode instruction set as compact 32-bit binary - **Native code** — lowers mcode to QBE or LLVM intermediate language, then compiles to machine code for the target CPU architecture ## Stages ### Tokenize (`tokenize.cm`) Splits source text into tokens. Handles string interpolation by re-tokenizing template literal contents. Produces a token array with position information (line, column). ### Parse (`parse.cm`) Converts tokens into an AST. Also performs semantic analysis: - **Scope records**: For each scope (global, function), builds a record mapping variable names to their metadata: `make` (var/def/function/input), `function_nr`, `nr_uses`, `closure` flag, and `level`. - **Type tags**: When the right-hand side of a `def` is a syntactically obvious type, stamps `type_tag` on the scope record entry. Derivable types: `"integer"`, `"number"`, `"text"`, `"array"`, `"record"`, `"function"`, `"logical"`. For `def` variables, type tags are also inferred from usage patterns: push (`x[] = v`) implies array, property access (`x.foo = v`) implies record, integer key implies array, text key implies record. - **Type error detection**: For `def` variables with known type tags, provably wrong operations are reported as compile errors: property access on arrays, push on non-arrays, text keys on arrays, integer keys on records. Only `def` variables are checked because `var` can be reassigned. - **Intrinsic resolution**: Names used but not locally bound are recorded in `ast.intrinsics`. Name nodes referencing intrinsics get `intrinsic: true`. - **Access kind**: Subscript (`[`) nodes get `access_kind`: `"index"` for numeric subscripts, `"field"` for string subscripts, omitted otherwise. - **Tail position**: Return statements where the expression is a call get `tail: true`. ### Fold (`fold.cm`) Operates on the AST. Performs constant folding and type analysis: - **Constant folding**: Evaluates arithmetic on known constants at compile time (e.g., `5 + 10` becomes `15`). - **Constant propagation**: Tracks `def` bindings whose values are known constants. - **Type propagation**: Extends `type_tag` through operations. When both operands of an arithmetic op have known types, the result type is known. Propagates type tags to reference sites. - **Intrinsic specialization**: When an intrinsic call's argument types are known, stamps a `hint` on the call node. For example, `length(x)` where x is a known array gets `hint: "array_length"`. Type checks like `is_array(known_array)` are folded to `true`. - **Purity analysis**: Expressions with no side effects are marked pure (literals, name references, arithmetic on pure operands, calls to pure intrinsics). The pure intrinsic set contains only `is_*` sensory functions — they are the only intrinsics guaranteed to never disrupt regardless of argument types. Other intrinsics like `text`, `number`, and `length` can disrupt on wrong argument types and are excluded. - **Dead code elimination**: Removes unreachable branches when conditions are known constants. Removes unused `var`/`def` declarations with pure initializers. Removes standalone calls to pure intrinsics where the result is discarded. ### Mcode (`mcode.cm`) Lowers the AST to a JSON-based intermediate representation with explicit operations. Key design principle: **every type check is an explicit instruction** so downstream optimizers can see and eliminate them. - **Typed load/store**: Emits `load_index` (array by integer), `load_field` (record by string), or `load_dynamic` (unknown) based on type information from fold. - **Decomposed calls**: Function calls are split into `frame` (create call frame) + `setarg` (set arguments) + `invoke` (execute call). - **Intrinsic access**: Intrinsic functions are loaded via `access` with an intrinsic marker rather than global lookup. - **Intrinsic inlining**: Type-check intrinsics (`is_array`, `is_text`, `is_number`, `is_integer`, `is_logical`, `is_null`, `is_function`, `is_object`, `is_stone`), `length`, and `push` are emitted as direct opcodes instead of frame/setarg/invoke call sequences. - **Disruption handler labels**: When a function has a disruption handler, a label is emitted before the handler code. This allows the streamline optimizer's unreachable code elimination to safely nop dead code after `return` without accidentally eliminating the handler. - **Tail call marking**: When a return statement's expression is a call and the function has no disruption handler, the final `invoke` is renamed to `tail_invoke`. This marks the call site for future tail call optimization. Functions with disruption handlers cannot use TCO because the handler frame must remain on the stack. See [Mcode IR](mcode.md) for the instruction format and complete instruction reference. ### Streamline (`streamline.cm`) Optimizes the Mcode IR through a series of independent passes. Operates per-function: 1. **Backward type inference**: Infers parameter types from how they are used in typed operators (`add_int`, `store_index`, `load_field`, `push`, `pop`, etc.). Immutable `def` parameters keep their inferred type across label join points. 2. **Write-type invariance**: Determines which local slots have a consistent write type across all instructions. Slots written by child closures (via `put`) are excluded (forced to unknown). 3. **Type-check elimination**: When a slot's type is known, eliminates `is_` + conditional jump pairs. Narrows `load_dynamic`/`store_dynamic` to typed variants. 4. **Algebraic simplification**: Rewrites identity operations (add 0, multiply 1, divide 1) and folds same-slot comparisons. 5. **Boolean simplification**: Fuses `not` + conditional jump into a single jump with inverted condition. 6. **Move elimination**: Removes self-moves (`move a, a`). 7. **Unreachable elimination**: Nops dead code after `return` until the next label. 8. **Dead jump elimination**: Removes jumps to the immediately following label. See [Streamline Optimizer](streamline.md) for detailed pass descriptions. ### Machine The streamlined mcode is lowered to a machine target for execution. #### Mach VM (default) The Mach VM is a register-based virtual machine that directly interprets the mcode instruction set as 32-bit binary bytecode. The Mach serializer (`mach.c`) converts streamlined mcode JSON into compact 32-bit instructions with a constant pool. Since the mach bytecode is a direct encoding of the mcode, the [Mcode IR](mcode.md) reference serves as the authoritative instruction set documentation. ``` pit script.ce ``` #### Native Code (QBE / LLVM) Lowers the streamlined mcode to QBE or LLVM intermediate language for compilation to native machine code. Each mcode function becomes a native function that calls into the ƿit runtime (`cell_rt_*` functions) for operations that require the runtime (allocation, intrinsic dispatch, etc.). String constants are interned in a data section. Integer constants are encoded inline. ``` pit --emit-qbe script.ce > output.ssa ``` ## Boot Seeds The `boot/` directory contains pre-compiled mcode IR (JSON) seed files for the pipeline modules: ``` boot/tokenize.cm.mcode boot/parse.cm.mcode boot/fold.cm.mcode boot/mcode.cm.mcode boot/streamline.cm.mcode boot/bootstrap.cm.mcode ``` Seeds are used during cold start (empty cache) to compile the pipeline modules from source. The engine's `load_pipeline_module()` hashes the **source file** content — if the source changes, the hash changes, the cache misses, and the module is recompiled from source using the boot seeds. This means: - Editing a pipeline module (e.g. `tokenize.cm`) takes effect on the next run automatically - Seeds only need regenerating if the pipeline changes in a way the existing seeds can't compile the new source, or before distribution - Use `pit seed` to regenerate all seeds, and `pit seed --clean` to also clear the build cache ## Files | File | Role | |------|------| | `tokenize.cm` | Lexer | | `parse.cm` | Parser + semantic analysis | | `fold.cm` | Constant folding + type analysis | | `mcode.cm` | AST → Mcode IR lowering | | `streamline.cm` | Mcode IR optimizer | | `qbe_emit.cm` | Mcode IR → QBE IL emitter | | `qbe.cm` | QBE IL operation templates | | `internal/bootstrap.cm` | Cache seeder (cold start only) | | `internal/engine.cm` | Self-sufficient pipeline loader and orchestrator | ## Debug Tools | File | Purpose | |------|---------| | `mcode.ce --pretty` | Print raw Mcode IR before streamlining | | `streamline.ce --types` | Print streamlined IR with type annotations | | `streamline.ce --stats` | Print IR after streamlining with before/after stats | ## Test Files | File | Tests | |------|-------| | `parse_test.ce` | Type tags, access_kind, intrinsic resolution | | `fold_test.ce` | Type propagation, purity, intrinsic hints | | `mcode_test.ce` | Typed load/store, decomposed calls | | `streamline_test.ce` | Optimization counts, IR before/after | | `qbe_test.ce` | End-to-end QBE IL generation | | `test_intrinsics.cm` | Inlined intrinsic opcodes (is_array, length, push, etc.) | | `test_backward.cm` | Backward type propagation for parameters |