cell/docs/spec/streamline.md

---
title: "Streamline Optimizer"
description: "Mcode IR optimization passes"
---

## Overview

The streamline optimizer (`streamline.cm`) runs a series of independent passes over the Mcode IR to eliminate redundant operations. Each pass is a standalone function that can be enabled, disabled, or reordered. Passes communicate only through the instruction array they mutate in place, replacing eliminated instructions with nop strings (e.g., `_nop_tc_1`).

The optimizer runs after `mcode.cm` generates the IR and before the result is lowered to the Mach VM or emitted as QBE IL.

```
Fold (AST) → Mcode (JSON IR) → Streamline → Mach VM / QBE
```

## Type Lattice

The optimizer tracks a type for each slot in the register file:

| Type | Meaning |
|------|---------|
| `unknown` | No type information |
| `int` | Integer |
| `float` | Floating-point |
| `num` | Number (subsumes int and float) |
| `text` | String |
| `bool` | Logical (true/false) |
| `null` | Null value |
| `array` | Array |
| `record` | Record (object) |
| `function` | Function |
| `blob` | Binary blob |

Subsumption: `int` and `float` both satisfy a `num` check.

## Passes

### 1. infer_param_types (backward type inference)

Scans typed operators and generic arithmetic to determine what types their operands must be. For example, `subtract dest, a, b` implies both `a` and `b` are numbers.

When a parameter slot (1..nr_args) is consistently inferred as a single type, that type is recorded. Since parameters are immutable (`def`), the inferred type holds for the entire function and persists across label join points (loop headers, branch targets).

Backward inference rules:

| Operator class | Operand type inferred |
|---|---|
| `add`, `subtract`, `multiply`, `divide`, `modulo`, `pow`, `negate` | T_NUM |
| bitwise ops (`bitand`, `bitor`, `bitxor`, `shl`, `shr`, `ushr`, `bitnot`) | T_INT |
| `concat` | T_TEXT |
| `not`, `and`, `or` | T_BOOL |
| `store_index` (object operand) | T_ARRAY |
| `store_index` (index operand) | T_INT |
| `store_field` (object operand) | T_RECORD |
| `push` (array operand) | T_ARRAY |
| `load_index` (object operand) | T_ARRAY |
| `load_index` (index operand) | T_INT |
| `load_field` (object operand) | T_RECORD |
| `pop` (array operand) | T_ARRAY |

Typed comparison operators (`eq_int`, `lt_float`, `lt_text`, etc.) and typed boolean comparisons (`eq_bool`, `ne_bool`) are excluded from backward inference. These ops always appear inside guard dispatch patterns (`is_type` + `jump_false` + typed_op), where mutually exclusive branches use the same slot with different types. Including them would merge conflicting types (e.g., T_INT from `lt_int` + T_FLOAT from `lt_float` + T_TEXT from `lt_text`) into T_UNKNOWN, losing all type information. Only unconditionally executed ops contribute to backward inference.

Note: `add` infers T_NUM even though it is polymorphic (numeric addition or text concatenation). When `add` appears in the IR, both operands have already passed a `is_num` guard, so they are guaranteed to be numeric. The text concatenation path uses `concat` instead.

When a slot appears with conflicting type inferences, the merge widens: INT + FLOAT → NUM, INT + NUM → NUM, FLOAT + NUM → NUM. Incompatible types (e.g., NUM + TEXT) produce `unknown`.

**Nop prefix:** none (analysis only, does not modify instructions)

### 2. infer_slot_write_types (slot write-type invariance)

Scans all instructions to determine which non-parameter slots have a consistent write type. If every instruction that writes to a given slot produces the same type, that type is globally invariant and can safely persist across label join points.

This analysis is sound because:
- `alloc_slot()` in mcode.cm is monotonically increasing — temp slots are never reused
- All local variable declarations must be at function body level and initialized — slots are written before any backward jumps to loop headers
- `move` is conservatively treated as T_UNKNOWN, avoiding unsound transitive assumptions

Write type mapping:

| Instruction class | Write type |
|---|---|
| `int` | T_INT |
| `true`, `false` | T_BOOL |
| `null` | T_NULL |
| `access` | type of literal value |
| `array` | T_ARRAY |
| `record` | T_RECORD |
| `function` | T_FUNCTION |
| `length` | T_INT |
| bitwise ops | T_INT |
| `concat` | T_TEXT |
| `negate` | T_NUM |
| `add`, `subtract`, `multiply`, `divide`, `modulo`, `pow` | T_NUM |
| bool ops, comparisons, `in` | T_BOOL |
| `move`, `load_field`, `load_index`, `load_dynamic`, `pop`, `get` | T_UNKNOWN |
| `invoke`, `tail_invoke` | T_UNKNOWN |

The result is a map of slot→type for slots where all writes agree on a single known type. Parameter slots (1..nr_args) and slot 0 are excluded.

Common patterns this enables:

- **Length variables** (`var len = length(arr)`): written by `length` (T_INT) only → invariant T_INT
- **Boolean flags** (`var found = false; ... found = true`): written by `false` and `true` → invariant T_BOOL
- **Locally-created containers** (`var arr = []`): written by `array` only → invariant T_ARRAY
- **Numeric accumulators** (`var sum = 0; sum = sum - x`): written by `access 0` (T_INT) and `subtract` (T_NUM) → merges to T_NUM

Note: Loop counters using `+` (`var i = 0; i = i + 1`) may not achieve write-type invariance because the `+` operator emits a guard dispatch with both `concat` (T_TEXT) and `add` (T_NUM) paths writing to the same temp slot, producing T_UNKNOWN. However, when one operand is a known number literal, `mcode.cm` emits a numeric-only path (see "Known-Number Add Shortcut" below), avoiding the text dispatch. Other arithmetic ops (`-`, `*`, `/`, `%`, `**`) always emit a single numeric write path and work cleanly with write-type analysis.

**Nop prefix:** none (analysis only, does not modify instructions)

### 3. eliminate_type_checks (type-check + jump elimination)

Forward pass that tracks the known type of each slot. When a type check (`is_int`, `is_text`, `is_num`, etc.) is followed by a conditional jump, and the slot's type is already known, the check and jump can be eliminated or converted to an unconditional jump.

Five cases:

- **Known match** (e.g., `is_int` on a slot known to be `int`): both the check and the conditional jump are eliminated (nop'd).
- **Subsumption match** (e.g., `is_num` on a slot known to be `int` or `float`): since `int` and `float` are subtypes of `num`, both the check and jump are eliminated.
- **Subsumption partial** (e.g., `is_int` on a slot known to be `num`): the `num` type could be `int` or `float`, so the check must remain. On fallthrough, the slot narrows to the checked subtype (`int`). This is NOT a mismatch — `num` values can pass an `is_int` check.
- **Known mismatch** (e.g., `is_text` on a slot known to be `int`): the check is nop'd and the conditional jump is rewritten to an unconditional `jump`.
- **Unknown**: the check remains, but on fallthrough, the slot's type is narrowed to the checked type (enabling downstream eliminations).

This pass also reduces `load_dynamic`/`store_dynamic` to `load_field`/`store_field` or `load_index`/`store_index` when the key slot's type is known.

At label join points, all type information is reset except for parameter types from backward inference and write-invariant types from slot write-type analysis.

**Nop prefix:** `_nop_tc_`

### 4. simplify_algebra (same-slot comparison folding)

Tracks known constant values. Folds same-slot comparisons:

| Pattern | Rewrite |
|---------|---------|
| `eq_* dest, x, x` | `true dest` |
| `le_* dest, x, x` | `true dest` |
| `ge_* dest, x, x` | `true dest` |
| `is_identical dest, x, x` | `true dest` |
| `ne_* dest, x, x` | `false dest` |
| `lt_* dest, x, x` | `false dest` |
| `gt_* dest, x, x` | `false dest` |

**Nop prefix:** none (rewrites in place, does not create nops)

### 5. simplify_booleans (not + jump fusion)

Peephole pass that eliminates unnecessary `not` instructions:

| Pattern | Rewrite |
|---------|---------|
| `not d, x; jump_false d, L` | nop; `jump_true x, L` |
| `not d, x; jump_true d, L` | nop; `jump_false x, L` |
| `not d1, x; not d2, d1` | nop; `move d2, x` |

This is particularly effective on `if (!cond)` patterns, which the compiler generates as `not; jump_false`. After this pass, they become a single `jump_true`.

**Nop prefix:** `_nop_bl_`

### 6. eliminate_moves (self-move elimination)

Removes `move a, a` instructions where the source and destination are the same slot. These can arise from earlier passes rewriting binary operations into moves.

**Nop prefix:** `_nop_mv_`

### 7. eliminate_unreachable (dead code after return)

Nops instructions after `return` until the next real label. Only `return` is treated as a terminal instruction; `disrupt` is not, because the disruption handler code immediately follows `disrupt` and must remain reachable.

The mcode compiler emits a label at disruption handler entry points (see `emit_label(gen_label("disruption"))` in mcode.cm), which provides the label boundary that stops this pass from eliminating handler code.

**Nop prefix:** `_nop_ur_`

### 8. eliminate_dead_jumps (jump-to-next-label elimination)

Removes `jump L` instructions where `L` is the immediately following label (skipping over any intervening nop strings). These are common after other passes eliminate conditional branches, leaving behind jumps that fall through naturally.

**Nop prefix:** `_nop_dj_`

## Pass Composition

All passes run in sequence in `optimize_function`:

```
infer_param_types        → returns param_types map
infer_slot_write_types   → returns write_types map
eliminate_type_checks    → uses param_types + write_types
simplify_algebra
simplify_booleans
eliminate_moves
eliminate_unreachable
eliminate_dead_jumps
```

Each pass is independent and can be commented out for testing or benchmarking.

## Intrinsic Inlining

Before streamlining, `mcode.cm` recognizes calls to built-in intrinsic functions and emits direct opcodes instead of the generic frame/setarg/invoke call sequence. This reduces a 6-instruction call pattern to a single instruction:

| Call | Emitted opcode |
|------|---------------|
| `is_array(x)` | `is_array dest, src` |
| `is_function(x)` | `is_func dest, src` |
| `is_object(x)` | `is_record dest, src` |
| `is_stone(x)` | `is_stone dest, src` |
| `is_integer(x)` | `is_int dest, src` |
| `is_text(x)` | `is_text dest, src` |
| `is_number(x)` | `is_num dest, src` |
| `is_logical(x)` | `is_bool dest, src` |
| `is_null(x)` | `is_null dest, src` |
| `length(x)` | `length dest, src` |
| `push(arr, val)` | `push arr, val` |

These inlined opcodes have corresponding Mach VM implementations in `mach.c`.

## Unified Arithmetic

Arithmetic operations use generic opcodes: `add`, `subtract`, `multiply`, `divide`, `modulo`, `pow`, `negate`. There are no type-dispatched variants (e.g., no `add_int`/`add_float`).

The Mach VM handles arithmetic inline with a two-tier fast path. Since mcode's type guard dispatch guarantees both operands are numbers by the time arithmetic executes, the VM does not need polymorphic dispatch:

1. **Int-int fast path**: `JS_VALUE_IS_BOTH_INT` → native integer arithmetic with overflow check. If the result fits int32, returns int32; otherwise promotes to float64.
2. **Float fallback**: `JS_ToFloat64` both operands → native floating-point arithmetic. Non-finite results (infinity, NaN) produce null.

Division and modulo additionally check for zero divisor (→ null). Power uses `pow()` with non-finite handling.

The legacy `reg_vm_binop()` function remains available for comparison operators and any non-mcode bytecode paths, but arithmetic ops no longer call it.

Bitwise operations (`shl`, `shr`, `ushr`, `bitand`, `bitor`, `bitxor`, `bitnot`) remain integer-only and disrupt if operands are not integers.

The QBE/native backend maps generic arithmetic to helper calls (`qbe.add`, `qbe.sub`, etc.). The vision for the native path is that with sufficient type inference, the backend can unbox proven-numeric values to raw registers, operate directly, and only rebox at boundaries (returns, calls, stores).

## Known-Number Add Shortcut

The `+` operator is the only arithmetic op that is polymorphic at the mcode level — `emit_add_decomposed` in `mcode.cm` emits a guard dispatch that checks for text (→ `concat`) before numeric (→ `add`). This dual dispatch means the temp slot is written by both `concat` (T_TEXT) and `add` (T_NUM), producing T_UNKNOWN in write-type analysis.

When either operand is a known number literal (e.g., `i + 1`, `x + 0.5`), `emit_add_decomposed` skips the text dispatch entirely and emits `emit_numeric_binop("add")` — a single `is_num` guard + `add` with no `concat` path. This is safe because text concatenation requires both operands to be text; a known number can never participate in concat.

This optimization eliminates 6-8 instructions from the add block (two `is_text` checks, two conditional jumps, `concat`, `jump`) and produces a clean single-type write path that works with write-type analysis.

Other arithmetic ops (`subtract`, `multiply`, etc.) always use `emit_numeric_binop` and never have this problem.

## Target Slot Propagation

For simple local variable assignments (`i = expr`), the mcode compiler passes the variable's register slot as a `target` to the expression compiler. Binary operations that use `emit_numeric_binop` (subtract, multiply, divide, modulo, pow) can write directly to the target slot instead of allocating a temp and emitting a `move`:

```
// Before: i = i - 1
subtract 7, 2, 6    // temp = i - 1
move 2, 7           // i = temp

// After: i = i - 1
subtract 2, 2, 6    // i = i - 1 (direct)
```

The `+` operator is excluded from target slot propagation when it would use the full text+num dispatch (i.e., when neither operand is a known number), because writing both `concat` and `add` to the variable's slot would pollute its write type. When the known-number shortcut applies, `+` uses `emit_numeric_binop` and would be safe for target propagation, but this is not currently implemented — the exclusion is by operator kind, not by dispatch path.

## Debugging Tools

CLI tools inspect the IR at different stages:

- **`cell mcode --pretty`** — prints the raw Mcode IR after `mcode.cm`, before streamlining
- **`cell streamline --stats`** — prints the IR after streamlining, with before/after instruction counts
- **`cell streamline --types`** — prints the streamlined IR with type annotations on each instruction

Usage:
```
cell mcode --pretty <file.ce|file.cm>
cell streamline --stats <file.ce|file.cm>
cell streamline --types <file.ce|file.cm>
```

## Tail Call Marking

When a function's return expression is a call (`stmt.tail == true` from the parser) and the function has no disruption handler, mcode.cm renames the final `invoke` instruction to `tail_invoke`. This is semantically identical to `invoke` in the current Mach VM, but marks the call site for future tail call optimization.

The disruption handler restriction exists because TCO would discard the current frame, but the handler must remain on the stack to catch disruptions from the callee.

`tail_invoke` is handled by the same passes as `invoke` in streamline (type tracking, algebraic simplification) and executes identically in the VM.

## Type Propagation Architecture

Type information flows through three compilation stages, each building on the previous:

### Stage 1: Parse-time type tags (parse.cm)

The parser assigns `type_tag` strings to scope variable entries when the type is syntactically obvious:

- **From initializers**: `def a = []` → `type_tag: "array"`, `def n = 42` → `type_tag: "integer"`, `def r = {}` → `type_tag: "record"`
- **From usage patterns** (def only): `def x = null; x[] = v` infers `type_tag: "array"` from the push. `def x = null; x.foo = v` infers `type_tag: "record"` from property access.
- **Type error detection** (def only): When a `def` variable has a known type_tag, provably wrong operations are compile errors:
  - Property access (`.`) on array
  - Push (`[]`) on non-array
  - Text key on array
  - Integer key on record

Only `def` (constant) variables participate in type inference and error detection. `var` variables can be reassigned, making their initializer type unreliable.

### Stage 2: Fold-time type propagation (fold.cm)

The fold pass extends type information through the AST:

- **Intrinsic folding**: `is_array(known_array)` folds to `true`. `length(known_array)` gets `hint: "array_length"`.
- **Purity analysis**: Expressions involving only `is_*` intrinsic calls with pure arguments are considered pure. This enables dead code elimination for unused `var`/`def` bindings with pure initializers, and elimination of standalone pure call statements.
- **Dead code**: Unused pure `var`/`def` declarations are removed. Standalone calls to pure intrinsics (where the result is discarded) are removed. Unreachable branches with constant conditions are removed.

The `pure_intrinsics` set currently contains only `is_*` sensory functions (`is_array`, `is_text`, `is_number`, `is_integer`, `is_function`, `is_logical`, `is_null`, `is_object`, `is_stone`). Other intrinsics like `text`, `number`, and `length` can disrupt on wrong argument types, so they are excluded — removing a call that would disrupt changes observable behavior.

### Stage 3: Streamline-time type tracking (streamline.cm)

The streamline optimizer uses a numeric type lattice (`T_INT`, `T_FLOAT`, `T_TEXT`, etc.) for fine-grained per-instruction tracking:

- **Backward inference** (pass 1): Scans typed operators to infer parameter types. Since parameters are `def` (immutable), inferred types persist across label boundaries.
- **Write-type invariance** (pass 2): Scans all instructions to find local slots where every write produces the same type. These invariant types persist across label boundaries alongside parameter types.
- **Forward tracking** (pass 3): `track_types` follows instruction execution order, tracking the type of each slot. Known-type operations set their destination type (e.g., `concat` → T_TEXT, `length` → T_INT). Generic arithmetic produces T_UNKNOWN. Type checks on unknown slots narrow the type on fallthrough.
- **Type check elimination** (pass 3): When a slot's type is already known, `is_<type>` + conditional jump pairs are eliminated or converted to unconditional jumps.
- **Dynamic access narrowing** (pass 3): `load_dynamic`/`store_dynamic` are narrowed to `load_field`/`store_field` or `load_index`/`store_index` when the key type is known.

Type information resets at label join points (since control flow merges could bring different types), except for parameter types from backward inference and write-invariant types from slot write-type analysis.

## Future Work

### Copy Propagation

A basic-block-local copy propagation pass would replace uses of a copied variable with its source, enabling further move elimination. An implementation was attempted but encountered an unsolved bug where 2-position instruction operand replacement produces incorrect code during self-hosting (the replacement logic for 3-position instructions works correctly). The root cause is not yet understood. See the project memory files for detailed notes.

### Expanded Purity Analysis

The current purity set is conservative (only `is_*`). It could be expanded by:

- **Argument-type-aware purity**: If all arguments to an intrinsic are known to be the correct types (via type_tag or slot_types), the call cannot disrupt and is safe to eliminate. For example, `length(known_array)` is pure but `length(unknown)` is not.
- **User function purity**: Analyze user-defined function bodies during pre_scan. A function is pure if its body contains only pure expressions and calls to known-pure functions. This requires fixpoint iteration for mutual recursion.
- **Callback-aware purity**: Intrinsics like `filter`, `find`, `reduce`, `some`, `every` are pure if their callback argument is pure.

### Move Type Resolution in Write-Type Analysis

Currently, `move` instructions produce T_UNKNOWN in write-type analysis. This prevents type propagation through moves — e.g., a slot written by `access 0` (T_INT) and `move` from an `add` result (T_NUM) merges to T_UNKNOWN instead of T_NUM.

A two-pass approach would fix this: first compute write types for all non-move instructions, then resolve moves by looking up the source slot's computed type. If the source has a known type, merge it into the destination; if unknown, skip the move (don't poison the destination with T_UNKNOWN).

This was implemented and tested but causes a bootstrap failure during self-hosting convergence. The root cause is not yet understood — the optimizer modifies its own bytecode, and the move resolution changes the type landscape enough to produce different code on each pass, preventing convergence. Further investigation is needed; the fix is correct in isolation but interacts badly with the self-hosting fixed-point iteration.

### Target Slot Propagation for Add with Known Numbers

When the known-number add shortcut applies (one operand is a literal number), the generated code uses `emit_numeric_binop` which has a single write path. Target slot propagation should be safe in this case, but is currently blocked by the blanket `kind != "+"` exclusion. Refining the exclusion to check whether the shortcut will apply (by testing `is_known_number` on either operand) would enable direct writes for patterns like `i = i + 1`.

### Forward Type Narrowing from Typed Operations

With unified arithmetic (generic `add`/`subtract`/`multiply`/`divide`/`modulo`/`negate` instead of typed variants), this approach is no longer applicable. Typed comparisons (`eq_int`, `lt_float`, etc.) still exist and their operands have known types, but these are already handled by backward inference.

### Guard Hoisting for Parameters

When a type check on a parameter passes (falls through), the parameter's type could be promoted to `param_types` so it persists across label boundaries. This would allow the first type check on a parameter to prove its type for the entire function. However, this is unsound for polymorphic parameters — if a function is called with different argument types, the first check would wrongly eliminate checks for subsequent types.

A safe version would require proving that a parameter is monomorphic (called with only one type across all call sites), which requires interprocedural analysis.

**Note:** For local variables (non-parameters), the write-type invariance analysis (pass 2) achieves a similar effect safely — if every write to a slot produces the same type, that type persists across labels without needing to hoist any guard.

### Tail Call Optimization

`tail_invoke` instructions are currently marked but execute identically to `invoke`. Actual TCO would reuse the current call frame instead of creating a new one. This requires:

- Ensuring argument count matches (or the frame can be resized)
- No live locals needed after the call (guaranteed by tail position)
- No disruption handler on the current function (already enforced by the marking)
- VM support in mach.c to rewrite the frame in place

### Interprocedural Type Inference

Currently all type inference is intraprocedural (within a single function). Cross-function analysis could:

- Infer return types from function bodies
- Propagate argument types from call sites to callees
- Specialize functions for known argument types (cloning)

### Strength Reduction

Common patterns that could be lowered to cheaper operations when operand types are known:

- `multiply x, 2` with proven-int operands → shift left
- `divide x, 2` with proven-int → arithmetic shift right
- `modulo x, power_of_2` with proven-int → bitwise and

### Numeric Unboxing (QBE/native path)

With unified arithmetic and backward type inference, the native backend can identify regions where numeric values remain in registers without boxing/unboxing:

1. **Guard once**: When backward inference proves a parameter is T_NUM, emit a single type guard at function entry.
2. **Unbox**: Convert the tagged JSValue to a raw double register.
3. **Operate**: Use native FP/int instructions directly (no function calls, no tag checks).
4. **Rebox**: Convert back to tagged JSValue only at rebox points (function returns, calls, stores to arrays/records).

This requires inserting `unbox`/`rebox` IR annotations (no-ops in the Mach VM, meaningful only to QBE).

### Loop-Invariant Code Motion

Type checks that are invariant across loop iterations (checking a variable that doesn't change in the loop body) could be hoisted above the loop. This would require identifying loop boundaries and proving invariance.

### Algebraic Identity Optimization

With unified arithmetic, algebraic identities (x+0→x, x*1→x, x*0→0, x/1→x) require knowing operand values at compile time. Since generic `add`/`multiply` operate on any numeric type, the constant-tracking logic in `simplify_algebra` could be extended to handle these for known-constant slots.

## Nop Convention

Eliminated instructions are replaced with strings matching `_nop_<prefix>_<counter>`. The prefix identifies which pass created the nop. Nop strings are:

- Skipped during interpretation (the VM ignores them)
- Skipped during QBE emission
- Not counted in instruction statistics
- Preserved in the instruction array to maintain positional stability for jump targets