--- title: "Compilation Pipeline" description: "Overview of the compilation stages and optimizations" --- ## Overview The compilation pipeline transforms source code through several stages, each adding information or lowering the representation toward execution. There are three execution backends: the Mach register VM (default), the Mcode interpreter (debug), and native code via QBE (experimental). ``` Source → Tokenize → Parse → Fold → Mach VM (default) → Mcode → Streamline → Mcode Interpreter → QBE → Native ``` ## Stages ### Tokenize (`tokenize.cm`) Splits source text into tokens. Handles string interpolation by re-tokenizing template literal contents. Produces a token array with position information (line, column). ### Parse (`parse.cm`) Converts tokens into an AST. Also performs semantic analysis: - **Scope records**: For each scope (global, function), builds a record mapping variable names to their metadata: `make` (var/def/function/input), `function_nr`, `nr_uses`, `closure` flag, and `level`. - **Type tags**: When the right-hand side of a `def` is a syntactically obvious type, stamps `type_tag` on the scope record entry. Derivable types: `"integer"`, `"number"`, `"text"`, `"array"`, `"record"`, `"function"`, `"logical"`, `"null"`. - **Intrinsic resolution**: Names used but not locally bound are recorded in `ast.intrinsics`. Name nodes referencing intrinsics get `intrinsic: true`. - **Access kind**: Subscript (`[`) nodes get `access_kind`: `"index"` for numeric subscripts, `"field"` for string subscripts, omitted otherwise. - **Tail position**: Return statements where the expression is a call get `tail: true`. ### Fold (`fold.cm`) Operates on the AST. Performs constant folding and type analysis: - **Constant folding**: Evaluates arithmetic on known constants at compile time (e.g., `5 + 10` becomes `15`). - **Constant propagation**: Tracks `def` bindings whose values are known constants. - **Type propagation**: Extends `type_tag` through operations. When both operands of an arithmetic op have known types, the result type is known. Propagates type tags to reference sites. - **Intrinsic specialization**: When an intrinsic call's argument types are known, stamps a `hint` on the call node. For example, `length(x)` where x is a known array gets `hint: "array_length"`. Type checks like `is_array(known_array)` are folded to `true`. - **Purity marking**: Stamps `pure: true` on expressions with no side effects (literals, name references, arithmetic on pure operands). - **Dead code elimination**: Removes unreachable branches when conditions are known constants. ### Mcode (`mcode.cm`) Lowers the AST to a JSON-based intermediate representation with explicit operations. Key design principle: **every type check is an explicit instruction** so downstream optimizers can see and eliminate them. - **Typed load/store**: Emits `load_index` (array by integer), `load_field` (record by string), or `load_dynamic` (unknown) based on type information from fold. - **Decomposed calls**: Function calls are split into `frame` (create call frame) + `setarg` (set arguments) + `invoke` (execute call). - **Intrinsic access**: Intrinsic functions are loaded via `access` with an intrinsic marker rather than global lookup. See [Mcode IR](mcode.md) for instruction format details. ### Streamline (`streamline.cm`) Optimizes the Mcode IR. Operates per-function: - **Redundant instruction elimination**: Removes no-op patterns and redundant moves. - **Dead code removal**: Eliminates instructions whose results are never used. - **Type-based narrowing**: When type information is available, narrows `load_dynamic`/`store_dynamic` to typed variants. ### QBE Emit (`qbe_emit.cm`) Lowers optimized Mcode IR to QBE intermediate language for native code compilation. Each Mcode function becomes a QBE function that calls into the cell runtime (`cell_rt_*` functions) for operations that require the runtime (allocation, intrinsic dispatch, etc.). String constants are interned in a data section. Integer constants are NaN-boxed inline. ### QBE Macros (`qbe.cm`) Provides operation implementations as QBE IL templates. Each arithmetic, comparison, and type operation is defined as a function that emits the corresponding QBE instructions, handling type dispatch (integer, float, text paths) with proper guard checks. ## Execution Backends ### Mach VM (default) Binary 32-bit register VM. Used for production execution and bootstrapping. ``` ./cell script.ce ``` ### Mcode Interpreter JSON-based interpreter. Used for debugging the compilation pipeline. ``` ./cell --mcode script.ce ``` ### QBE Native (experimental) Generates QBE IL that can be compiled to native code. ``` ./cell --emit-qbe script.ce > output.ssa ``` ## Files | File | Role | |------|------| | `tokenize.cm` | Lexer | | `parse.cm` | Parser + semantic analysis | | `fold.cm` | Constant folding + type analysis | | `mcode.cm` | AST → Mcode IR lowering | | `streamline.cm` | Mcode IR optimizer | | `qbe_emit.cm` | Mcode IR → QBE IL emitter | | `qbe.cm` | QBE IL operation templates | | `internal/bootstrap.cm` | Pipeline orchestrator | ## Test Files | File | Tests | |------|-------| | `parse_test.ce` | Type tags, access_kind, intrinsic resolution | | `fold_test.ce` | Type propagation, purity, intrinsic hints | | `mcode_test.ce` | Typed load/store, decomposed calls | | `streamline_test.ce` | Optimization counts, IR before/after | | `qbe_test.ce` | End-to-end QBE IL generation |