Merge branch 'optimize_mcode'

2026-02-21 19:42:19 -06:00
parent eadad194be 017b63ba80
commit 2d4645da9c
34 changed files with 53477 additions and 121900 deletions
--- a/docs/compiler-tools.md
+++ b/docs/compiler-tools.md
@@ -30,6 +30,10 @@ Each stage has a corresponding CLI tool that lets you see its output.
 | streamline  | `streamline.ce --ir`      | Human-readable canonical IR            |
 | disasm      | `disasm.ce`               | Source-interleaved disassembly          |
 | disasm      | `disasm.ce --optimized`   | Optimized source-interleaved disassembly |
+| diff        | `diff_ir.ce`              | Mcode vs streamline instruction diff   |
+| xref        | `xref.ce`                 | Cross-reference / call creation graph  |
+| cfg         | `cfg.ce`                  | Control flow graph (basic blocks)      |
+| slots       | `slots.ce`                | Slot data flow / use-def chains        |
 | all         | `ir_report.ce`            | Structured optimizer flight recorder   |

 All tools take a source file as input and run the pipeline up to the relevant stage.
@@ -141,6 +145,160 @@ Function creation instructions include a cross-reference annotation showing the
  3     function       5, 12                                  :235  ; -> [12] helper_fn
 ```

+## diff_ir.ce
+
+Compares mcode IR (before optimization) with streamline IR (after optimization), showing what the optimizer changed. Useful for understanding which instructions were eliminated, specialized, or rewritten.
+
+```bash
+cell diff_ir <file>                  # diff all functions
+cell diff_ir --fn <N|name> <file>    # diff only one function
+cell diff_ir --summary <file>        # counts only
+```
+
+| Flag | Description |
+|------|-------------|
+| (none) | Show all diffs with source interleaving |
+| `--fn <N\|name>` | Filter to specific function by index or name |
+| `--summary` | Show only eliminated/rewritten counts per function |
+
+### Output Format
+
+Changed instructions are shown in diff style with `-` (before) and `+` (after) lines:
+
+```
+=== [0] <anonymous> (args=1, slots=40) ===
+  17 eliminated, 51 rewritten
+
+  --- line 4: if (n <= 1) { ---
+  - 1     is_int         4, 1                          :4
+  + 1     is_int         3, 1                          :4  (specialized)
+  - 3     is_int         5, 2                          :4
+  + 3     _nop_tc_1                                         (eliminated)
+```
+
+Summary mode gives a quick overview:
+
+```
+  [0] <anonymous>:                       17 eliminated, 51 rewritten
+  [1] <anonymous>:                       65 eliminated, 181 rewritten
+  total: 86 eliminated, 250 rewritten across 4 functions
+```
+
+## xref.ce
+
+Cross-reference / call graph tool. Shows which functions create other functions (via `function` instructions), building a creation tree.
+
+```bash
+cell xref <file>                     # full creation tree
+cell xref --callers <N> <file>       # who creates function [N]?
+cell xref --callees <N> <file>       # what does [N] create/call?
+cell xref --dot <file>               # DOT graph for graphviz
+cell xref --optimized <file>         # use optimized IR
+```
+
+| Flag | Description |
+|------|-------------|
+| (none) | Indented creation tree from main |
+| `--callers <N>` | Show which functions create function [N] |
+| `--callees <N>` | Show what function [N] creates (use -1 for main) |
+| `--dot` | Output DOT format for graphviz |
+| `--optimized` | Use optimized IR instead of raw mcode |
+
+### Output Format
+
+Default tree view:
+
+```
+demo_disasm.cm
+  [0] <anonymous>
+  [1] <anonymous>
+  [2] <anonymous>
+```
+
+Caller/callee query:
+
+```
+Callers of [0] <anonymous>:
+  demo_disasm.cm at line 3
+```
+
+DOT output can be piped to graphviz: `cell xref --dot file.cm | dot -Tpng -o xref.png`
+
+## cfg.ce
+
+Control flow graph tool. Identifies basic blocks from labels and jumps, computes edges, and detects loop back-edges.
+
+```bash
+cell cfg --fn <N|name> <file>        # text CFG for function
+cell cfg --dot --fn <N|name> <file>  # DOT output for graphviz
+cell cfg <file>                      # text CFG for all functions
+cell cfg --optimized <file>          # use optimized IR
+```
+
+| Flag | Description |
+|------|-------------|
+| `--fn <N\|name>` | Filter to specific function by index or name |
+| `--dot` | Output DOT format for graphviz |
+| `--optimized` | Use optimized IR instead of raw mcode |
+
+### Output Format
+
+```
+=== [0] <anonymous> ===
+  B0 [pc 0-2, line 4]:
+    0     access         2, 1
+    1     is_int         4, 1
+    2     jump_false     4, "rel_ni_2"
+    -> B3 "rel_ni_2" (jump)
+    -> B1 (fallthrough)
+
+  B1 [pc 3-4, line 4]:
+    3     is_int         5, 2
+    4     jump_false     5, "rel_ni_2"
+    -> B3 "rel_ni_2" (jump)
+    -> B2 (fallthrough)
+```
+
+Each block shows its ID, PC range, source lines, instructions, and outgoing edges. Loop back-edges (target PC <= source PC) are annotated.
+
+## slots.ce
+
+Slot data flow analysis. Builds use-def chains for every slot in a function, showing where each slot is defined and used. Optionally captures type information from streamline.
+
+```bash
+cell slots --fn <N|name> <file>              # slot summary for function
+cell slots --slot <N> --fn <N|name> <file>   # trace slot N
+cell slots <file>                            # slot summary for all functions
+```
+
+| Flag | Description |
+|------|-------------|
+| `--fn <N\|name>` | Filter to specific function by index or name |
+| `--slot <N>` | Show chronological DEF/USE trace for a specific slot |
+
+### Output Format
+
+Summary shows each slot with its def count, use count, inferred type, and first definition. Dead slots (defined but never used) are flagged:
+
+```
+=== [0] <anonymous> (args=1, slots=40) ===
+  slot    defs    uses    type        first-def
+  s0      0       0       -           (this)
+  s1      0       10      -           (arg 0)
+  s2      1       6       -           pc 0: access
+  s10     1       0       -           pc 29: invoke  <- dead
+```
+
+Slot trace (`--slot N`) shows every DEF and USE in program order:
+
+```
+=== slot 3 in [0] <anonymous> ===
+  DEF  pc 5:     le_int         3, 1, 2                       :4
+  DEF  pc 11:    le_float       3, 1, 2                       :4
+  DEF  pc 17:    le_text        3, 1, 2                       :4
+  USE  pc 31:    jump_false     3, "if_else_0"                :4
+```
+
 ## seed.ce

 Regenerates the boot seed files in `boot/`. These are pre-compiled mcode IR (JSON) files that bootstrap the compilation pipeline on cold start.
--- a/docs/spec/mach.md
+++ b/docs/spec/mach.md
@@ -93,3 +93,13 @@ Arithmetic ops (ADD, SUB, MUL, DIV, MOD, POW) are executed inline without callin
 DIV and MOD check for zero divisor (→ null). POW uses `pow()` with non-finite handling for finite inputs.

 Comparison ops (EQ through GE) and bitwise ops still use `reg_vm_binop()` for their slow paths, as they handle a wider range of type combinations (string comparisons, null equality, etc.).
+
+## String Concatenation
+
+CONCAT has a three-tier dispatch for self-assign patterns (`concat R(A), R(A), R(C)` where dest equals the left operand):
+
+1. **In-place append**: If `R(A)` is a mutable heap text (S bit clear) with `length + rhs_length <= cap56`, characters are appended directly. Zero allocation, zero GC.
+2. **Growth allocation** (`JS_ConcatStringGrow`): Allocates a new text with 2x capacity and does **not** stone the result, leaving it mutable for subsequent appends.
+3. **Exact-fit stoned** (`JS_ConcatString`): Used when dest differs from the left operand (normal non-self-assign concat).
+
+The `stone_text` instruction (iABC, B=0, C=0) sets the S bit on a mutable heap text in `R(A)`. For non-pointer values or already-stoned text, it is a no-op. This instruction is emitted by the streamline optimizer at escape points; see [Streamline — insert_stone_text](streamline.md#7-insert_stone_text-mutable-text-escape-analysis) and [Stone Memory — Mutable Text](stone.md#mutable-text-concatenation).
--- a/docs/spec/mcode.md
+++ b/docs/spec/mcode.md
@@ -101,6 +101,11 @@ Operands are register slot numbers (integers), constant values (strings, numbers
 | Instruction | Operands | Description |
 |-------------|----------|-------------|
 | `concat` | `dest, a, b` | `dest = a ~ b` (text concatenation) |
+| `stone_text` | `slot` | Stone a mutable text value (see below) |
+
+The `stone_text` instruction is emitted by the streamline optimizer's escape analysis pass (`insert_stone_text`). It freezes a mutable text value before it escapes its defining slot — for example, before a `move`, `setarg`, `store_field`, `push`, or `put`. The instruction is only inserted when the slot is provably `T_TEXT`; non-text values never need stoning. See [Streamline Optimizer — insert_stone_text](streamline.md#7-insert_stone_text-mutable-text-escape-analysis) for details.
+
+At the VM level, `stone_text` is a single-operand instruction (iABC with B=0, C=0). If the slot holds a heap text without the S bit set, it sets the S bit. For all other values (integers, booleans, already-stoned text, etc.), it is a no-op.

 ### Comparison — Integer

--- a/docs/spec/stone.md
+++ b/docs/spec/stone.md
@@ -77,6 +77,30 @@ Messages between actors are stoned before delivery, ensuring actors never share

 Literal objects and arrays that can be determined at compile time may be allocated directly in stone memory.

+## Mutable Text Concatenation
+
+String concatenation in a loop (`s = s + "x"`) is optimized to O(n) amortized by leaving concat results **unstoned** with over-allocated capacity. On the next concatenation, if the destination text is mutable (S bit clear) and has enough room, the VM appends in-place with zero allocation.
+
+### How It Works
+
+When the VM executes `concat dest, dest, src` (same destination and left operand — a self-assign pattern):
+
+1. **Inline fast path**: If `dest` holds a heap text, is not stoned, and `length + src_length <= capacity` — append characters in place, update length, done. No allocation, no GC possible.
+
+2. **Growth path** (`JS_ConcatStringGrow`): Allocate a new text with `capacity = max(new_length * 2, 16)`, copy both operands, and return the result **without stoning** it. The 2x growth factor means a loop of N concatenations does O(log N) allocations totaling O(N) character copies.
+
+3. **Exact-fit path** (`JS_ConcatString`): When `dest != left` (not self-assign), the existing exact-fit stoned path is used. This is the normal case for expressions like `var c = a + b`.
+
+### Safety Invariant
+
+**An unstoned heap text is uniquely referenced by exactly one slot.** This is enforced by the `stone_text` mcode instruction, which the [streamline optimizer](streamline.md#7-insert_stone_text-mutable-text-escape-analysis) inserts before any instruction that would create a second reference to the value (move, store, push, setarg, put). Two VM-level guards cover cases where the compiler cannot prove the type: `get` (closure reads) and `return` (inter-frame returns).
+
+### Why Over-Allocation Is GC-Safe
+
+- The copying collector copies based on `cap56` (the object header's capacity field), not `length`. Over-allocated capacity survives GC.
+- `js_alloc_string` zero-fills the packed data region, so padding beyond `length` is always clean.
+- String comparisons, hashing, and interning all use `length`, not `cap56`. Extra capacity is invisible to string operations.
+
 ## Relationship to GC

 The Cheney copying collector only operates on the mutable heap. During collection, when the collector encounters a pointer to stone memory (S bit set), it skips it — stone objects are roots that never move. This means stone memory acts as a permanent root set with zero GC overhead.
--- a/docs/spec/streamline.md
+++ b/docs/spec/streamline.md
@@ -164,7 +164,44 @@ Removes `move a, a` instructions where the source and destination are the same s

 **Nop prefix:** `_nop_mv_`

-### 7. eliminate_unreachable (dead code after return)
+### 7. insert_stone_text (mutable text escape analysis)
+
+Inserts `stone_text` instructions before mutable text values escape their defining slot. This pass supports the mutable text concatenation optimization (see [Stone Memory — Mutable Text](stone.md#mutable-text-concatenation)), which leaves `concat` results unstoned with excess capacity so that subsequent `s = s + x` can append in-place.
+
+The invariant is: **an unstoned heap text is uniquely referenced by exactly one slot.** This pass ensures that whenever a text value is copied or shared (via move, store, push, function argument, closure write, etc.), it is stoned first.
+
+**Algorithm:**
+
+1. **Compute liveness.** Build `first_ref[slot]` and `last_ref[slot]` arrays by scanning all instructions. Extend live ranges for backward jumps (loops): if a backward jump targets label L at position `lpos`, every slot referenced between `lpos` and the jump has its `last_ref` extended to the jump position.
+
+2. **Forward walk with type tracking.** Walk instructions using `track_types` to maintain per-slot types. At each escape point, if the escaping slot is provably `T_TEXT`, insert `stone_text slot` before the instruction.
+
+3. **Move special case.** For `move dest, src`: only insert `stone_text src` if the source is `T_TEXT` **and** `last_ref[src] > i` (the source slot is still live after the move, meaning both slots alias the same text). If the source is dead after the move, the value transfers uniquely — no stoning needed.
+
+**Escape points and the slot that gets stoned:**
+
+| Instruction | Stoned slot | Why it escapes |
+|---|---|---|
+| `move` | source (if still live) | Two slots alias the same value |
+| `store_field` | value | Stored to object property |
+| `store_index` | value | Stored to array element |
+| `store_dynamic` | value | Dynamic property store |
+| `push` | value | Pushed to array |
+| `setarg` | value | Passed as function argument |
+| `put` | source | Written to outer closure frame |
+
+**Not handled by this pass** (handled by VM guards instead):
+
+| Instruction | Reason |
+|---|---|
+| `get` (closure read) | Value arrives from outer frame; type may be T_UNKNOWN at compile time |
+| `return` | Return value's type may be T_UNKNOWN; VM stones at inter-frame boundary |
+
+These two cases use runtime `stone_mutable_text` guards in the VM because the streamline pass cannot always prove the slot type across frame boundaries.
+
+**Nop prefix:** none (inserts instructions, does not create nops)
+
+### 8. eliminate_unreachable (dead code after return)

 Nops instructions after `return` until the next real label. Only `return` is treated as a terminal instruction; `disrupt` is not, because the disruption handler code immediately follows `disrupt` and must remain reachable.

@@ -172,13 +209,13 @@ The mcode compiler emits a label at disruption handler entry points (see `emit_l

 **Nop prefix:** `_nop_ur_`

-### 8. eliminate_dead_jumps (jump-to-next-label elimination)
+### 9. eliminate_dead_jumps (jump-to-next-label elimination)

 Removes `jump L` instructions where `L` is the immediately following label (skipping over any intervening nop strings). These are common after other passes eliminate conditional branches, leaving behind jumps that fall through naturally.

 **Nop prefix:** `_nop_dj_`

-### 9. diagnose_function (compile-time diagnostics)
+### 10. diagnose_function (compile-time diagnostics)

 Optional pass that runs when `_warn` is set on the mcode input. Performs a forward type-tracking scan and emits diagnostics for provably wrong operations. Diagnostics are collected in `ir._diagnostics` as `{severity, file, line, col, message}` records.

@@ -219,6 +256,7 @@ eliminate_type_checks    → uses param_types + write_types
 simplify_algebra
 simplify_booleans
 eliminate_moves
+insert_stone_text        → escape analysis for mutable text
 eliminate_unreachable
 eliminate_dead_jumps
 diagnose_function        → optional, when _warn is set
@@ -286,7 +324,9 @@ move 2, 7           // i = temp
 subtract 2, 2, 6    // i = i - 1 (direct)
 ```

-The `+` operator is excluded from target slot propagation when it would use the full text+num dispatch (i.e., when neither operand is a known number), because writing both `concat` and `add` to the variable's slot would pollute its write type. When the known-number shortcut applies, `+` uses `emit_numeric_binop` and would be safe for target propagation, but this is not currently implemented — the exclusion is by operator kind, not by dispatch path.
+The `+` operator uses target slot propagation when the target slot equals the left operand (`target == left_slot`), i.e. for self-assign patterns like `s = s + x`. In this case both `concat` and `add` write to the same slot that already holds the left operand, so write-type pollution is acceptable — the value is being updated in place. For other cases (target differs from left operand), `+` still allocates a temp to avoid polluting the target slot's write type with both T_TEXT and T_NUM.
+
+This enables the VM's in-place append fast path for string concatenation: when `concat dest, dest, src` has the same destination and left operand, the VM can append directly to a mutable text's excess capacity without allocating.

 ## Debugging Tools

@@ -375,7 +415,7 @@ This was implemented and tested but causes a bootstrap failure during self-hosti

 ### Target Slot Propagation for Add with Known Numbers

-When the known-number add shortcut applies (one operand is a literal number), the generated code uses `emit_numeric_binop` which has a single write path. Target slot propagation should be safe in this case, but is currently blocked by the blanket `kind != "+"` exclusion. Refining the exclusion to check whether the shortcut will apply (by testing `is_known_number` on either operand) would enable direct writes for patterns like `i = i + 1`.
+When the known-number add shortcut applies (one operand is a literal number), the generated code uses `emit_numeric_binop` which has a single write path. Target slot propagation is already enabled for the self-assign case (`i = i + 1`), but when the target differs from the left operand and neither operand is a known number, a temp is still used. Refining the exclusion to check `is_known_number` would enable direct writes for the remaining non-self-assign cases like `j = i + 1`.

 ### Forward Type Narrowing from Typed Operations