compile optimization

This commit is contained in:
2026-02-10 16:37:11 -06:00
parent 3d71f4a363
commit d75ce916d7
18 changed files with 1528 additions and 113 deletions

View File

@@ -10,12 +10,11 @@ Mcode is a JSON-based intermediate representation that can be interpreted direct
## Pipeline
```
Source → Tokenize → Parse (AST) → Mcode (JSON) → Interpret
→ Compile to Mach (planned)
→ Compile to native (planned)
Source → Tokenize → Parse (AST) → Fold → Mcode (JSON) → Streamline → Interpret
→ QBE → Native
```
Mcode is produced by the `JS_Mcode` compiler pass, which emits a cJSON tree. The mcode interpreter walks this tree directly, dispatching on instruction name strings.
Mcode is produced by `mcode.cm`, which lowers the folded AST to JSON instruction arrays. The streamline optimizer (`streamline.cm`) then eliminates redundant operations. The result can be interpreted by `mcode.c`, or lowered to QBE IL by `qbe_emit.cm` for native compilation. See [Compilation Pipeline](pipeline.md) for the full overview.
## JSMCode Structure
@@ -44,16 +43,37 @@ struct JSMCode {
## Instruction Format
Each instruction is a JSON array. The first element is the instruction name (string), followed by operands:
Each instruction is a JSON array. The first element is the instruction name (string), followed by operands (typically `[op, dest, ...args, line, col]`):
```json
["LOADK", 0, 42]
["ADD", 2, 0, 1]
["JMPFALSE", 3, "else_label"]
["CALL", 0, 2, 1]
["access", 3, 5, 1, 9]
["load_index", 10, 4, 9, 5, 11]
["store_dynamic", 4, 11, 12, 6, 3]
["frame", 15, 14, 1, 7, 7]
["setarg", 15, 0, 16, 7, 7]
["invoke", 15, 13, 7, 7]
```
The instruction set mirrors the Mach VM opcodes — same operations, same register semantics, but with string dispatch instead of numeric opcodes.
### Typed Load/Store
Memory operations come in typed variants for optimization:
- `load_index dest, obj, idx` — array element by integer index
- `load_field dest, obj, key` — record property by string key
- `load_dynamic dest, obj, key` — unknown; dispatches at runtime
- `store_index obj, val, idx` — array element store
- `store_field obj, val, key` — record property store
- `store_dynamic obj, val, key` — unknown; dispatches at runtime
The compiler selects the appropriate variant based on `type_tag` and `access_kind` annotations from parse and fold.
### Decomposed Calls
Function calls are split into separate instructions:
- `frame dest, fn, argc` — allocate call frame
- `setarg frame, idx, val` — set argument
- `invoke frame, result` — execute the call
## Labels

118
docs/spec/pipeline.md Normal file
View File

@@ -0,0 +1,118 @@
---
title: "Compilation Pipeline"
description: "Overview of the compilation stages and optimizations"
---
## Overview
The compilation pipeline transforms source code through several stages, each adding information or lowering the representation toward execution. There are three execution backends: the Mach register VM (default), the Mcode interpreter (debug), and native code via QBE (experimental).
```
Source → Tokenize → Parse → Fold → Mach VM (default)
→ Mcode → Streamline → Mcode Interpreter
→ QBE → Native
```
## Stages
### Tokenize (`tokenize.cm`)
Splits source text into tokens. Handles string interpolation by re-tokenizing template literal contents. Produces a token array with position information (line, column).
### Parse (`parse.cm`)
Converts tokens into an AST. Also performs semantic analysis:
- **Scope records**: For each scope (global, function), builds a record mapping variable names to their metadata: `make` (var/def/function/input), `function_nr`, `nr_uses`, `closure` flag, and `level`.
- **Type tags**: When the right-hand side of a `def` is a syntactically obvious type, stamps `type_tag` on the scope record entry. Derivable types: `"integer"`, `"number"`, `"text"`, `"array"`, `"record"`, `"function"`, `"logical"`, `"null"`.
- **Intrinsic resolution**: Names used but not locally bound are recorded in `ast.intrinsics`. Name nodes referencing intrinsics get `intrinsic: true`.
- **Access kind**: Subscript (`[`) nodes get `access_kind`: `"index"` for numeric subscripts, `"field"` for string subscripts, omitted otherwise.
- **Tail position**: Return statements where the expression is a call get `tail: true`.
### Fold (`fold.cm`)
Operates on the AST. Performs constant folding and type analysis:
- **Constant folding**: Evaluates arithmetic on known constants at compile time (e.g., `5 + 10` becomes `15`).
- **Constant propagation**: Tracks `def` bindings whose values are known constants.
- **Type propagation**: Extends `type_tag` through operations. When both operands of an arithmetic op have known types, the result type is known. Propagates type tags to reference sites.
- **Intrinsic specialization**: When an intrinsic call's argument types are known, stamps a `hint` on the call node. For example, `length(x)` where x is a known array gets `hint: "array_length"`. Type checks like `is_array(known_array)` are folded to `true`.
- **Purity marking**: Stamps `pure: true` on expressions with no side effects (literals, name references, arithmetic on pure operands).
- **Dead code elimination**: Removes unreachable branches when conditions are known constants.
### Mcode (`mcode.cm`)
Lowers the AST to a JSON-based intermediate representation with explicit operations. Key design principle: **every type check is an explicit instruction** so downstream optimizers can see and eliminate them.
- **Typed load/store**: Emits `load_index` (array by integer), `load_field` (record by string), or `load_dynamic` (unknown) based on type information from fold.
- **Decomposed calls**: Function calls are split into `frame` (create call frame) + `setarg` (set arguments) + `invoke` (execute call).
- **Intrinsic access**: Intrinsic functions are loaded via `access` with an intrinsic marker rather than global lookup.
See [Mcode IR](mcode.md) for instruction format details.
### Streamline (`streamline.cm`)
Optimizes the Mcode IR. Operates per-function:
- **Redundant instruction elimination**: Removes no-op patterns and redundant moves.
- **Dead code removal**: Eliminates instructions whose results are never used.
- **Type-based narrowing**: When type information is available, narrows `load_dynamic`/`store_dynamic` to typed variants.
### QBE Emit (`qbe_emit.cm`)
Lowers optimized Mcode IR to QBE intermediate language for native code compilation. Each Mcode function becomes a QBE function that calls into the cell runtime (`cell_rt_*` functions) for operations that require the runtime (allocation, intrinsic dispatch, etc.).
String constants are interned in a data section. Integer constants are NaN-boxed inline.
### QBE Macros (`qbe.cm`)
Provides operation implementations as QBE IL templates. Each arithmetic, comparison, and type operation is defined as a function that emits the corresponding QBE instructions, handling type dispatch (integer, float, text paths) with proper guard checks.
## Execution Backends
### Mach VM (default)
Binary 32-bit register VM. Used for production execution and bootstrapping.
```
./cell script.ce
```
### Mcode Interpreter
JSON-based interpreter. Used for debugging the compilation pipeline.
```
./cell --mcode script.ce
```
### QBE Native (experimental)
Generates QBE IL that can be compiled to native code.
```
./cell --emit-qbe script.ce > output.ssa
```
## Files
| File | Role |
|------|------|
| `tokenize.cm` | Lexer |
| `parse.cm` | Parser + semantic analysis |
| `fold.cm` | Constant folding + type analysis |
| `mcode.cm` | AST → Mcode IR lowering |
| `streamline.cm` | Mcode IR optimizer |
| `qbe_emit.cm` | Mcode IR → QBE IL emitter |
| `qbe.cm` | QBE IL operation templates |
| `internal/bootstrap.cm` | Pipeline orchestrator |
## Test Files
| File | Tests |
|------|-------|
| `parse_test.ce` | Type tags, access_kind, intrinsic resolution |
| `fold_test.ce` | Type propagation, purity, intrinsic hints |
| `mcode_test.ce` | Typed load/store, decomposed calls |
| `streamline_test.ce` | Optimization counts, IR before/after |
| `qbe_test.ce` | End-to-end QBE IL generation |