update

2026-02-08 08:25:48 -06:00
parent bae4e957e9
commit a4f3b025c5
27 changed files with 2044 additions and 8 deletions
--- a/docs/kim.md
+++ b/docs/kim.md
@@ -0,0 +1,94 @@
+---
+title: "Kim Encoding"
+description: "Compact character and count encoding"
+weight: 80
+type: "docs"
+---
+
+Kim is a character and count encoding designed by Douglas Crockford. It encodes Unicode characters and variable-length integers using continuation bytes. Kim is simpler and more compact than UTF-8 for most text.
+
+## Continuation Bytes
+
+The fundamental idea in Kim is the continuation byte:
+
+```
+C  D  D  D  D  D  D  D
+```
+
+- **C** — continue bit. If 1, read another byte. If 0, this is the last byte.
+- **D** (7 bits) — data bits.
+
+To decode: shift the accumulator left by 7 bits, add the 7 data bits. If the continue bit is 1, repeat with the next byte. If 0, the value is complete.
+
+To encode: take the value, emit 7 bits at a time from most significant to least significant, setting the continue bit on all bytes except the last.
+
+## Character Encoding
+
+Kim encodes Unicode codepoints directly as continuation byte sequences:
+
+| Range | Bytes | Characters |
+|-------|-------|------------|
+| U+0000 to U+007F | 1 | ASCII |
+| U+0080 to U+3FFF | 2 | First quarter of BMP |
+| U+4000 to U+10FFFF | 3 | All other Unicode |
+
+Unlike UTF-8, there is no need for surrogate pairs or escapement. Every Unicode character, including emoji and characters from extended planes, is encoded in at most 3 bytes.
+
+### Examples
+
+```
+'A'       (U+0041)  →  41
+'é'       (U+00E9)  →  81 69
+'💩'      (U+1F4A9)  →  87 E9 29
+```
+
+## Count Encoding
+
+Kim is also used for encoding counts (lengths, sizes). The same continuation byte format represents non-negative integers of arbitrary size:
+
+| Range | Bytes |
+|-------|-------|
+| 0 to 127 | 1 |
+| 128 to 16383 | 2 |
+| 16384 to 2097151 | 3 |
+
+## Comparison with UTF-8
+
+| Property | Kim | UTF-8 |
+|----------|-----|-------|
+| ASCII | 1 byte | 1 byte |
+| BMP (first quarter) | 2 bytes | 2-3 bytes |
+| Full Unicode | 3 bytes | 3-4 bytes |
+| Self-synchronizing | No | Yes |
+| Sortable | No | Yes |
+| Simpler to implement | Yes | No |
+| Byte count for counts | Variable (7 bits/byte) | Not applicable |
+
+Kim trades self-synchronization (the ability to find character boundaries from any position) for simplicity and compactness. In practice, Kim text is accessed sequentially, so self-synchronization is not needed.
+
+## Usage in ƿit
+
+Kim is used internally by blobs and by the Nota message format.
+
+### In Blobs
+
+The `blob.write_text` and `blob.read_text` functions use Kim to encode text into binary data:
+
+```javascript
+var blob = use('blob')
+var b = blob.make()
+blob.write_text(b, "hello")  // Kim-encoded length + characters
+stone(b)
+var text = blob.read_text(b, 0)  // "hello"
+```
+
+### In Nota
+
+Nota uses Kim for two purposes:
+
+1. **Counts** — array lengths, text lengths, blob sizes, record pair counts
+2. **Characters** — text content within Nota messages
+
+The preamble byte of each Nota value incorporates the first few bits of a Kim-encoded count, with the continue bit indicating whether more bytes follow.
+
+See [Nota Format](#nota) for the full specification.