Wota decoding very slow for large objects #32
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Take recommendations from here into consideration:
Here’s what the profile is really telling us and the quickest wins you’re likely to feel on-CPU.
a. ObjectRef push/pop
wota_stack_push/pop() does a malloc/free for every nested object. Even though each call is only ~1 µs, at deep recursion it adds up.
// One-time pool; grows by chunks of e.g. 256 nodes.
typedef struct RefPool {
ObjectRef *free_list;
ObjectRef *chunk_list;
} RefPool;
static ObjectRef *pool_alloc(RefPool *p) {
if (!p->free_list) {
// allocate a new chunk
ObjectRef *chunk = malloc(sizeof(ObjectRef) * 256);
for (int i = 0; i < 256; ++i) {
chunk[i].next = p->free_list;
p->free_list = &chunk[i];
}
// link so we can free the whole chunk once
chunk[0].next = p->chunk_list;
p->chunk_list = chunk;
}
ObjectRef *r = p->free_list;
p->free_list = r->next;
return r;
}
static void pool_free(RefPool *p, ObjectRef *r) {
r->next = p->free_list;
p->free_list = r;
}
Replace the malloc/free pair with pool_alloc/pool_free and release the chunks once at the very end (wota_stack_free). That alone removes ~1 ms in your trace.
b. wota_buffer_grow hotspots
Every write_* does an alloc check. You spend > 5 ms of the 89 ms in realloc.
Reserve once: estimate object_count * 16 bytes and call wota_buffer_reserve(&wb, guess) before encoding. If you don’t want heuristics, double the growth factor (from ×2 to ×4 or ×8) so you call system realloc far less often.
2. Trim property enumeration cost
JS_GetOwnPropertyNames + JS_IsFunction + JS_AtomToCStringLen dominates the encode profile.
a. Avoid stringifying atoms unless you really need UTF-8
Inside encode_object_properties you immediately do
const char *prop_name = JS_AtomToCStringLen(ctx, &plen, atoms[i]);
wota_write_text_len(&enc->wb, prop_name, plen);
Converting every atom to UTF-8 allocates and scans the string. 7 ms of the 89 ms is just JS_AtomToCStringLen, another ~4 ms freeing those strings.
Instead:
Encode the atom id (uint32_t) directly.
Put a one-time atom-dictionary after the value section (exactly what QuickJS’s own binary serializer does).
If you must stay format-compatible, cache the result:
typedef struct {
JSAtom atom;
const char utf8; // interned
size_t len;
} AtomCacheEntry;
kh_map(JSAtom, AtomCacheEntry) cache = ...; // use khash or stb_ds
// lookup-or-insert here; free cache once after encode
Even a simple LRU of 128 atoms removes ~10–15 % of encode time for property-heavy objects.
b. Filter functions without JS_GetProperty
Calling JS_GetProperty just to check “is it a function?” is overkill. Use the faster JS_GetOwnPropertyInternal with JS_PROP_HAS_VALUE plus a quick tag check, or look at the property flags that JS_GetOwnPropertyNames already gave you (ptab[i].is_enumerable && !ptab[i].is_symbol etc.). That lets you avoid most calls to JS_GetProperty, JS_FreeValue, JS_IsFunction.
Almost the entire 250 ms is JS_SetProperty thrashing object shapes:
QuickJS creates a new shape every time you add a property to a plain object that still has the default “shape hash table”.
When you add them one-by-one the engine clones / hashes / reallocs repeatedly.
a. Pre-allocate the shape once
QuickJS exposes an internal (JS_DefinePropertyValueList in newer trees). Build an array of {atom, value} pairs and define them in one call:
JSPropertyEnum *ptab = js_malloc(ctx, num * sizeof(*ptab));
JSValue *vals = js_malloc(ctx, num * sizeof(*vals));
...
obj = JS_NewObject(ctx);
JS_DefinePropertyValueList(ctx, obj, ptab, vals, num, JS_PROP_C_W_E);
Performance difference: 10–20× faster object construction.
If you don’t want to touch internals, simpler trick: build an array of key/value pairs first, then iterate and set properties (two passes but avoids shape cloning because the object starts in “dictionary” mode once size ≥ 4). e.g.
JSValue obj = JS_NewObject(ctx);
if (count >= 4) {
JS_SetPropertyUint32(ctx, obj, count, JS_UNDEFINED); // forces dictionary
JS_DeletePropertyInt64(ctx, obj, count, 0);
}
for (...) JS_SetProperty(ctx, obj, key_atom, val);
b. Arrays: set length after filling
JS_SetLength after JS_NewArray triggers two property updates. Instead:
JSValue arr = JS_NewArray(ctx); // len = 0
for (uint32_t i=0;i<count;i++)
JS_SetPropertyUint32(ctx, arr, i, elems[i]);
/* quickjs bumps length automatically */
or, if you really need the length up front, call JS_NewArray(ctx) then immediately JS_ResizeArray (internal) or fake-resize as above with a temporary last element.
The profile shows GC triggering repeatedly (JS_RunGCInternal). That’s because you allocate many small strings/atoms then immediately free them again.
Atom cache (same trick as encode) lets you reuse atoms instead of creating new ones each time a property name repeats.
If objects are short-lived (you throw them away after using), call JS_SetMemoryLimit(rt, ...) higher so GC doesn’t run mid-decode.
5. Micro-optimisations (worth doing only after the above)
Hotspot What to change Typical win
JS_DupValue/FreeValue pairs around primitives Stop dup/free when you already own the reference; most paths can use the original JSValue directly 3–5 %
wota_stack_has linear scan Replace with open-address hash table keyed by pointer 1–2 % on deep graphs
wota_buffer_grow memcpy use reallocf (macOS) or custom realloc_inplace to avoid memcpy when possible minor
Recursion depth Convert encode/decode to iterative stack (array) to get better cache behaviour minor
“Do I have to change the on-disk format?”
No. The big wins (memory-pool for refs, pre-reserve buffer, avoid per-property UTF-8 conversion, dictionary-mode objects on decode) do not break compatibility. If you want the last 10 %, adopting an “atom-table + data blob” layout (exactly what QuickJS’s built-in binary serializer does) is the cleanest way.
Quick order of attack
Pool ObjectRef + pre-reserve WotaBuffer – 1 hr → ~10 % speedup.
Skip UTF-8 conversion (atom cache) – 2 hrs → another 10–15 %.
Dictionary-mode object build on decode – 3 hrs → 2–3× faster decode.
Profile again; micro-opt where still hot.
Let me know which of these paths you want sample patches for, or if you’d like to discuss changing the wire format for even larger gains.