cell/kim.md at 45556c344db2e4411cbc67d468a487a8535ea686

john/cell

Files

2026-02-08 08:25:48 -06:00

2.9 KiB

Raw Blame History

title, description, weight, type

title	description	weight	type
Kim Encoding	Compact character and count encoding	80	docs

Kim is a character and count encoding designed by Douglas Crockford. It encodes Unicode characters and variable-length integers using continuation bytes. Kim is simpler and more compact than UTF-8 for most text.

Continuation Bytes

The fundamental idea in Kim is the continuation byte:

C  D  D  D  D  D  D  D

C — continue bit. If 1, read another byte. If 0, this is the last byte.
D (7 bits) — data bits.

To decode: shift the accumulator left by 7 bits, add the 7 data bits. If the continue bit is 1, repeat with the next byte. If 0, the value is complete.

To encode: take the value, emit 7 bits at a time from most significant to least significant, setting the continue bit on all bytes except the last.

Character Encoding

Kim encodes Unicode codepoints directly as continuation byte sequences:

Range	Bytes	Characters
U+0000 to U+007F	1	ASCII
U+0080 to U+3FFF	2	First quarter of BMP
U+4000 to U+10FFFF	3	All other Unicode

Unlike UTF-8, there is no need for surrogate pairs or escapement. Every Unicode character, including emoji and characters from extended planes, is encoded in at most 3 bytes.

Examples

'A'       (U+0041)  →  41
'é'       (U+00E9)  →  81 69
'💩'      (U+1F4A9)  →  87 E9 29

Count Encoding

Kim is also used for encoding counts (lengths, sizes). The same continuation byte format represents non-negative integers of arbitrary size:

Range	Bytes
0 to 127	1
128 to 16383	2
16384 to 2097151	3

Comparison with UTF-8

Property	Kim	UTF-8
ASCII	1 byte	1 byte
BMP (first quarter)	2 bytes	2-3 bytes
Full Unicode	3 bytes	3-4 bytes
Self-synchronizing	No	Yes
Sortable	No	Yes
Simpler to implement	Yes	No
Byte count for counts	Variable (7 bits/byte)	Not applicable

Kim trades self-synchronization (the ability to find character boundaries from any position) for simplicity and compactness. In practice, Kim text is accessed sequentially, so self-synchronization is not needed.

Usage in ƿit

Kim is used internally by blobs and by the Nota message format.

In Blobs

The blob.write_text and blob.read_text functions use Kim to encode text into binary data:

var blob = use('blob')
var b = blob.make()
blob.write_text(b, "hello")  // Kim-encoded length + characters
stone(b)
var text = blob.read_text(b, 0)  // "hello"

In Nota

Nota uses Kim for two purposes:

Counts — array lengths, text lengths, blob sizes, record pair counts
Characters — text content within Nota messages

The preamble byte of each Nota value incorporates the first few bits of a Kim-encoded count, with the continue bit indicating whether more bytes follow.

See Nota Format for the full specification.

2.9 KiB Raw Blame History