Wota decoding very slow for large objects #32

New Issue

john · 2025-06-08T16:35:39Z

john commented

2025-06-08 16:35:39 +00:00

Take recommendations from here into consideration:

Here’s what the profile is really telling us and the quickest wins you’re likely to feel on-CPU.

Stop allocating so much

a. ObjectRef push/pop
wota_stack_push/pop() does a malloc/free for every nested object. Even though each call is only ~1 µs, at deep recursion it adds up.

// One-time pool; grows by chunks of e.g. 256 nodes.
typedef struct RefPool {
ObjectRef *free_list;
ObjectRef *chunk_list;
} RefPool;

static ObjectRef *pool_alloc(RefPool *p) {
if (!p->free_list) {
// allocate a new chunk
ObjectRef *chunk = malloc(sizeof(ObjectRef) * 256);
for (int i = 0; i < 256; ++i) {
chunk[i].next = p->free_list;
p->free_list = &chunk[i];
}
// link so we can free the whole chunk once
chunk[0].next = p->chunk_list;
p->chunk_list = chunk;
}
ObjectRef *r = p->free_list;
p->free_list = r->next;
return r;
}

static void pool_free(RefPool *p, ObjectRef *r) {
r->next = p->free_list;
p->free_list = r;
}
Replace the malloc/free pair with pool_alloc/pool_free and release the chunks once at the very end (wota_stack_free). That alone removes ~1 ms in your trace.

b. wota_buffer_grow hotspots
Every write_* does an alloc check. You spend > 5 ms of the 89 ms in realloc.

Reserve once: estimate object_count * 16 bytes and call wota_buffer_reserve(&wb, guess) before encoding. If you don’t want heuristics, double the growth factor (from ×2 to ×4 or ×8) so you call system realloc far less often.
2. Trim property enumeration cost

JS_GetOwnPropertyNames + JS_IsFunction + JS_AtomToCStringLen dominates the encode profile.

a. Avoid stringifying atoms unless you really need UTF-8
Inside encode_object_properties you immediately do

const char *prop_name = JS_AtomToCStringLen(ctx, &plen, atoms[i]);
wota_write_text_len(&enc->wb, prop_name, plen);
Converting every atom to UTF-8 allocates and scans the string. 7 ms of the 89 ms is just JS_AtomToCStringLen, another ~4 ms freeing those strings.

Instead:

Encode the atom id (uint32_t) directly.
Put a one-time atom-dictionary after the value section (exactly what QuickJS’s own binary serializer does).
If you must stay format-compatible, cache the result:
typedef struct {
JSAtom atom;
const char utf8; // interned
size_t len;
} AtomCacheEntry;
kh_map(JSAtom, AtomCacheEntry) cache = ...; // use khash or stb_ds

// lookup-or-insert here; free cache once after encode
Even a simple LRU of 128 atoms removes ~10–15 % of encode time for property-heavy objects.

b. Filter functions without JS_GetProperty
Calling JS_GetProperty just to check “is it a function?” is overkill. Use the faster JS_GetOwnPropertyInternal with JS_PROP_HAS_VALUE plus a quick tag check, or look at the property flags that JS_GetOwnPropertyNames already gave you (ptab[i].is_enumerable && !ptab[i].is_symbol etc.). That lets you avoid most calls to JS_GetProperty, JS_FreeValue, JS_IsFunction.

Decode: turn the quadratic property builder into linear

Almost the entire 250 ms is JS_SetProperty thrashing object shapes:

QuickJS creates a new shape every time you add a property to a plain object that still has the default “shape hash table”.
When you add them one-by-one the engine clones / hashes / reallocs repeatedly.
a. Pre-allocate the shape once
QuickJS exposes an internal (JS_DefinePropertyValueList in newer trees). Build an array of {atom, value} pairs and define them in one call:

JSPropertyEnum *ptab = js_malloc(ctx, num * sizeof(*ptab));
JSValue *vals = js_malloc(ctx, num * sizeof(*vals));
...
obj = JS_NewObject(ctx);
JS_DefinePropertyValueList(ctx, obj, ptab, vals, num, JS_PROP_C_W_E);
Performance difference: 10–20× faster object construction.

If you don’t want to touch internals, simpler trick: build an array of key/value pairs first, then iterate and set properties (two passes but avoids shape cloning because the object starts in “dictionary” mode once size ≥ 4). e.g.

JSValue obj = JS_NewObject(ctx);
if (count >= 4) {
JS_SetPropertyUint32(ctx, obj, count, JS_UNDEFINED); // forces dictionary
JS_DeletePropertyInt64(ctx, obj, count, 0);
}
for (...) JS_SetProperty(ctx, obj, key_atom, val);
b. Arrays: set length after filling
JS_SetLength after JS_NewArray triggers two property updates. Instead:

JSValue arr = JS_NewArray(ctx); // len = 0
for (uint32_t i=0;i<count;i++)
JS_SetPropertyUint32(ctx, arr, i, elems[i]);
/* quickjs bumps length automatically */
or, if you really need the length up front, call JS_NewArray(ctx) then immediately JS_ResizeArray (internal) or fake-resize as above with a temporary last element.

Reduce GC churn on decode

The profile shows GC triggering repeatedly (JS_RunGCInternal). That’s because you allocate many small strings/atoms then immediately free them again.

Atom cache (same trick as encode) lets you reuse atoms instead of creating new ones each time a property name repeats.
If objects are short-lived (you throw them away after using), call JS_SetMemoryLimit(rt, ...) higher so GC doesn’t run mid-decode.
5. Micro-optimisations (worth doing only after the above)

Hotspot What to change Typical win
JS_DupValue/FreeValue pairs around primitives Stop dup/free when you already own the reference; most paths can use the original JSValue directly 3–5 %
wota_stack_has linear scan Replace with open-address hash table keyed by pointer 1–2 % on deep graphs
wota_buffer_grow memcpy use reallocf (macOS) or custom realloc_inplace to avoid memcpy when possible minor
Recursion depth Convert encode/decode to iterative stack (array) to get better cache behaviour minor
“Do I have to change the on-disk format?”
No. The big wins (memory-pool for refs, pre-reserve buffer, avoid per-property UTF-8 conversion, dictionary-mode objects on decode) do not break compatibility. If you want the last 10 %, adopting an “atom-table + data blob” layout (exactly what QuickJS’s built-in binary serializer does) is the cleanest way.

Quick order of attack

Pool ObjectRef + pre-reserve WotaBuffer – 1 hr → ~10 % speedup.
Skip UTF-8 conversion (atom cache) – 2 hrs → another 10–15 %.
Dictionary-mode object build on decode – 3 hrs → 2–3× faster decode.
Profile again; micro-opt where still hot.
Let me know which of these paths you want sample patches for, or if you’d like to discuss changing the wire format for even larger gains.

Take recommendations from here into consideration: Here’s what the profile is really telling us and the quickest wins you’re likely to feel on-CPU. 1. Stop allocating so much a. ObjectRef push/pop wota_stack_push/pop() does a malloc/free for every nested object. Even though each call is only ~1 µs, at deep recursion it adds up. // One-time pool; grows by chunks of e.g. 256 nodes. typedef struct RefPool { ObjectRef *free_list; ObjectRef *chunk_list; } RefPool; static ObjectRef *pool_alloc(RefPool *p) { if (!p->free_list) { // allocate a new chunk ObjectRef *chunk = malloc(sizeof(ObjectRef) * 256); for (int i = 0; i < 256; ++i) { chunk[i].next = p->free_list; p->free_list = &chunk[i]; } // link so we can free the whole chunk once chunk[0].next = p->chunk_list; p->chunk_list = chunk; } ObjectRef *r = p->free_list; p->free_list = r->next; return r; } static void pool_free(RefPool *p, ObjectRef *r) { r->next = p->free_list; p->free_list = r; } Replace the malloc/free pair with pool_alloc/pool_free and release the chunks once at the very end (wota_stack_free). That alone removes ~1 ms in your trace. b. wota_buffer_grow hotspots Every write_* does an alloc check. You spend > 5 ms of the 89 ms in realloc. Reserve once: estimate object_count * 16 bytes and call wota_buffer_reserve(&wb, guess) before encoding. If you don’t want heuristics, double the growth factor (from ×2 to ×4 or ×8) so you call system realloc far less often. 2. Trim property enumeration cost JS_GetOwnPropertyNames + JS_IsFunction + JS_AtomToCStringLen dominates the encode profile. a. Avoid stringifying atoms unless you really need UTF-8 Inside encode_object_properties you immediately do const char *prop_name = JS_AtomToCStringLen(ctx, &plen, atoms[i]); wota_write_text_len(&enc->wb, prop_name, plen); Converting every atom to UTF-8 allocates and scans the string. 7 ms of the 89 ms is just JS_AtomToCStringLen, another ~4 ms freeing those strings. Instead: Encode the atom id (uint32_t) directly. Put a one-time atom-dictionary after the value section (exactly what QuickJS’s own binary serializer does). If you must stay format-compatible, cache the result: typedef struct { JSAtom atom; const char *utf8; // interned size_t len; } AtomCacheEntry; kh_map(JSAtom, AtomCacheEntry*) cache = ...; // use khash or stb_ds // lookup-or-insert here; free cache once after encode Even a simple LRU of 128 atoms removes ~10–15 % of encode time for property-heavy objects. b. Filter functions without JS_GetProperty Calling JS_GetProperty just to check “is it a function?” is overkill. Use the faster JS_GetOwnPropertyInternal with JS_PROP_HAS_VALUE plus a quick tag check, or look at the property flags that JS_GetOwnPropertyNames already gave you (ptab[i].is_enumerable && !ptab[i].is_symbol etc.). That lets you avoid most calls to JS_GetProperty, JS_FreeValue, JS_IsFunction. 3. Decode: turn the quadratic property builder into linear Almost the entire 250 ms is JS_SetProperty thrashing object shapes: QuickJS creates a new shape every time you add a property to a plain object that still has the default “shape hash table”. When you add them one-by-one the engine clones / hashes / reallocs repeatedly. a. Pre-allocate the shape once QuickJS exposes an internal (JS_DefinePropertyValueList in newer trees). Build an array of {atom, value} pairs and define them in one call: JSPropertyEnum *ptab = js_malloc(ctx, num * sizeof(*ptab)); JSValue *vals = js_malloc(ctx, num * sizeof(*vals)); ... obj = JS_NewObject(ctx); JS_DefinePropertyValueList(ctx, obj, ptab, vals, num, JS_PROP_C_W_E); Performance difference: 10–20× faster object construction. If you don’t want to touch internals, simpler trick: build an array of key/value pairs first, then iterate and set properties (two passes but avoids shape cloning because the object starts in “dictionary” mode once size ≥ 4). e.g. JSValue obj = JS_NewObject(ctx); if (count >= 4) { JS_SetPropertyUint32(ctx, obj, count, JS_UNDEFINED); // forces dictionary JS_DeletePropertyInt64(ctx, obj, count, 0); } for (...) JS_SetProperty(ctx, obj, key_atom, val); b. Arrays: set length after filling JS_SetLength after JS_NewArray triggers two property updates. Instead: JSValue arr = JS_NewArray(ctx); // len = 0 for (uint32_t i=0;i<count;i++) JS_SetPropertyUint32(ctx, arr, i, elems[i]); /* quickjs bumps length automatically */ or, if you really need the length up front, call JS_NewArray(ctx) then immediately JS_ResizeArray (internal) or fake-resize as above with a temporary last element. 4. Reduce GC churn on decode The profile shows GC triggering repeatedly (JS_RunGCInternal). That’s because you allocate many small strings/atoms then immediately free them again. Atom cache (same trick as encode) lets you reuse atoms instead of creating new ones each time a property name repeats. If objects are short-lived (you throw them away after using), call JS_SetMemoryLimit(rt, ...) higher so GC doesn’t run mid-decode. 5. Micro-optimisations (worth doing only after the above) Hotspot What to change Typical win JS_DupValue/FreeValue pairs around primitives Stop dup/free when you already own the reference; most paths can use the original JSValue directly 3–5 % wota_stack_has linear scan Replace with open-address hash table keyed by pointer 1–2 % on deep graphs wota_buffer_grow memcpy use reallocf (macOS) or custom realloc_inplace to avoid memcpy when possible minor Recursion depth Convert encode/decode to iterative stack (array) to get better cache behaviour minor “Do I have to change the on-disk format?” No. The big wins (memory-pool for refs, pre-reserve buffer, avoid per-property UTF-8 conversion, dictionary-mode objects on decode) do not break compatibility. If you want the last 10 %, adopting an “atom-table + data blob” layout (exactly what QuickJS’s built-in binary serializer does) is the cleanest way. Quick order of attack Pool ObjectRef + pre-reserve WotaBuffer – 1 hr → ~10 % speedup. Skip UTF-8 conversion (atom cache) – 2 hrs → another 10–15 %. Dictionary-mode object build on decode – 3 hrs → 2–3× faster decode. Profile again; micro-opt where still hot. Let me know which of these paths you want sample patches for, or if you’d like to discuss changing the wire format for even larger gains.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: john/cell#32