2.9 KiB
title, description, weight, type
| title | description | weight | type |
|---|---|---|---|
| Kim Encoding | Compact character and count encoding | 80 | docs |
Kim is a character and count encoding designed by Douglas Crockford. It encodes Unicode characters and variable-length integers using continuation bytes. Kim is simpler and more compact than UTF-8 for most text.
Continuation Bytes
The fundamental idea in Kim is the continuation byte:
C D D D D D D D
- C — continue bit. If 1, read another byte. If 0, this is the last byte.
- D (7 bits) — data bits.
To decode: shift the accumulator left by 7 bits, add the 7 data bits. If the continue bit is 1, repeat with the next byte. If 0, the value is complete.
To encode: take the value, emit 7 bits at a time from most significant to least significant, setting the continue bit on all bytes except the last.
Character Encoding
Kim encodes Unicode codepoints directly as continuation byte sequences:
| Range | Bytes | Characters |
|---|---|---|
| U+0000 to U+007F | 1 | ASCII |
| U+0080 to U+3FFF | 2 | First quarter of BMP |
| U+4000 to U+10FFFF | 3 | All other Unicode |
Unlike UTF-8, there is no need for surrogate pairs or escapement. Every Unicode character, including emoji and characters from extended planes, is encoded in at most 3 bytes.
Examples
'A' (U+0041) → 41
'é' (U+00E9) → 81 69
'💩' (U+1F4A9) → 87 E9 29
Count Encoding
Kim is also used for encoding counts (lengths, sizes). The same continuation byte format represents non-negative integers of arbitrary size:
| Range | Bytes |
|---|---|
| 0 to 127 | 1 |
| 128 to 16383 | 2 |
| 16384 to 2097151 | 3 |
Comparison with UTF-8
| Property | Kim | UTF-8 |
|---|---|---|
| ASCII | 1 byte | 1 byte |
| BMP (first quarter) | 2 bytes | 2-3 bytes |
| Full Unicode | 3 bytes | 3-4 bytes |
| Self-synchronizing | No | Yes |
| Sortable | No | Yes |
| Simpler to implement | Yes | No |
| Byte count for counts | Variable (7 bits/byte) | Not applicable |
Kim trades self-synchronization (the ability to find character boundaries from any position) for simplicity and compactness. In practice, Kim text is accessed sequentially, so self-synchronization is not needed.
Usage in ƿit
Kim is used internally by blobs and by the Nota message format.
In Blobs
The blob.write_text and blob.read_text functions use Kim to encode text into binary data:
var blob = use('blob')
var b = blob.make()
blob.write_text(b, "hello") // Kim-encoded length + characters
stone(b)
var text = blob.read_text(b, 0) // "hello"
In Nota
Nota uses Kim for two purposes:
- Counts — array lengths, text lengths, blob sizes, record pair counts
- Characters — text content within Nota messages
The preamble byte of each Nota value incorporates the first few bits of a Kim-encoded count, with the continue bit indicating whether more bytes follow.
See Nota Format for the full specification.