Under the hood - how Bibbly actually works | Bibbly

Why pairwise, not filters

Most personalization products ask you to filter. Pick a hair color. Pick a skin tone. Pick a personality from a dropdown. Filtering scales linearly with options and breaks past about seven. Hick's law says decision time grows with the log of choice count, but that assumes the choices are commensurable. They almost never are.

Pairwise comparison sidesteps the whole problem. Two options, pick one, repeat. Each vote is one bit of binary signal, and a Bayesian model converts that signal into a posterior over a d-dimensional feature vector. The Bradley-Terry model says:

P(prefer A over B) = sigmoid(w^T * (features_A - features_B))

The model can only learn along dimensions you give it. If two candidates differ in ways the feature vector doesn't capture, the model treats them as identical. So feature design is the thing that determines whether 50 votes is enough or whether you need 500.

We use D-optimal active learning (Thompson Sampling over the posterior) to pick the next pair, so each vote is worth about 0.8 effective bits instead of the ~1 a random pair would give. For a d=20 feature space, the MAP estimate stabilizes around 3d = 60 votes. The 80% credible intervals tighten by 5d = 100 votes. Top-K candidates separate cleanly by 8d = 160.

That is the whole engine. Everything else (book generation, art-style consistency, page-level quality gates) is downstream of "we know what this family actually likes."

Your family as a schema

A character in Bibbly is not a name plus a face. It is a structured object with three layers.

Personality dipole vector. Five axes, each in [-1, 1]. The book engine uses these to write scenes that feel like the character.

{
  "gentle_rambunctious": 0.4,
  "shy_outgoing": -0.3,
  "silly_serious": -0.5,
  "patient_impatient": 0.0,
  "snuggly_independent": 0.6
}

Universe tonality spec. Voice rules that govern every book in the universe. Lock once, apply always.

{
  "voice_register": "silly-playful",
  "pov": "second-direct",
  "tense": "present",
  "vocabulary_age_target": "toddler-18-36mo",
  "sentence_rhythm": { "avg_words_per_sentence": 6, "max_words": 10 },
  "reference_passage": "The little fox tiptoed across the porch...",
  "dos_and_donts": "DO use second-person direct address. DONT use abstract emotion words."
}

Universe art-style rules. What makes the visuals look like your family's books and not someone else's.

{
  "medium": "crayon",
  "paper": { "surface": "cream-cover-stock", "hex": "#F5EFE0" },
  "palette": { "named": ["dusty-rose", "olive-cream", "soft-charcoal"] },
  "head_body_ratio": "1.4 (toddler-friendly chibi proportions)",
  "negative_constraints": ["no digital painting", "no photorealism", "no harsh black outlines"]
}

The schemas are why "we put your name in the book" and "we put your kid in the book" are different products. Names are strings. Characters are vectors. The vectors are what let the same family show up consistently across books, and they are what let the LLM that writes the next page know what would and would not be in character.

The art-style cascade

When you pick an art style during onboarding, the choice cascades. The cast composite sheet renders in that style. Every character's hero shot renders in that style. Every page background, every prop, every scene. When we add a new character to your universe six months later, the cascade re-applies and the new character renders in the same style as the original cast.

This is not a stylistic suggestion in a prompt. The art_style.rules schema above is concrete enough that two independent renders agree about what "your style" means. The negative constraints matter as much as the positive ones. no digital painting rules out a whole class of plausible-looking outputs that would feel wrong next to the rest of the cast.

The lock is intentional. Most personalized-content services rotate art every few months because it keeps the supply pipeline cheap. The cost is that the books feel like they belong to nobody. We pay the cost the other way: pick a style, keep it, generate every future image consistent with it.

Image generation as a DAG

A finished page in a Bibbly book is the leaf of a directed acyclic graph. Each node is a content-addressed image (key = hash of inputs + spec), and each edge is a dependency on a parent node.

art_style.rules + cast composite (parents)
        |
        v
character hero (e.g., "Mara at 18mo, three-quarter pose")
        |
        v
prev page final + scene env prior + object composite (parents)
        |
        v
page final (the leaf)

Content addressing means two pages that needed the same hero image share the hash and skip the regeneration. It also means every image carries its provenance. Inspecting a finished book, you can walk back from any panel to the cast sheet that anchored the character, the style spec that fixed the medium, and the env prior that placed the scene. Not "we trust the pipeline." A literal hash chain.

This matters when something looks wrong on page 7. The repair flow finds the failing leaf, identifies the parent set, and decides which parent to regenerate. Because the rest of the DAG is content-addressed, regenerating one parent invalidates exactly the descendants that depend on it. The other 9 pages stay pinned.

The dual-judge soft-gate

Every generated book passes through two independent LLM judges before it ships. Both run on gemini-3.1-pro-preview. Both return Zod-validated verdicts with per-axis scores.

Tonality judge reads the text against the universe's tonality_spec. Did this page hit the voice register? The vocabulary age target? The sentence rhythm? It returns pass | borderline | fail, per-page findings, and axis scores.

Visual judge reads the rendered images against art_style.rules. Did the character look like itself? Did the medium stay consistent? It returns the same shape, plus cross-image consistency and outlier flags.

The judges are decontextualized. They do not know what the prior version looked like, who generated it, or what is at stake. They see the artifact and the spec.

When a judge returns borderline or fail, a fix loop runs. The classifier picks the cheapest tier of repair that has a plausible chance of resolving the failure: regenerate the offending image, regenerate the page text, or in the heaviest case regenerate the page plot. We snapshot before each tier so a regression rolls back. A book that started borderline and got worse during the fix loop ships as the original, not as the worse version.

This is why we can ship books that don't need a human in the loop on every page, and why the failure mode is "ships the same book again" rather than "ships something off-style."

Cost economics

A finished Bibbly board book is about 10 pages. Each page goes through ~3 image renders during composition (cast composite, scene composite, final), and per-image cost on the providers we use sits in the $0.05 - $0.30 range depending on quality tier and model. The marginal compute cost of a 10-page book is roughly $1.50 - $9.00, with most books landing near $3.

Re-renders are cheap because the DAG hashes the upstream parents. If only the page-7 final regenerates, you pay one image. The fix-loop tier ladder is calibrated around this:

Tier	What regenerates	Cost	When it triggers
1	Single panel	~$0.20	Visual judge flags one outlier
2	Page text or single composite	~$0.50	Tonality borderline, single page
3	Full page plot	~$1.50	Both judges fail same page

Print and fulfillment is the dominant cost line in a Story Box, well above compute. Which means we can run the fix loop generously without unit economics breaking.

What we are still working on

The judges are calibrated against the universe specs, but the specs themselves are written by us. The next step is the calibration bake-off: a triple-blind 3 canons by 7 personas by 3 critic models design that surfaces which critics' opinions track real-reader taste across personas. The spec is written, the runner code is built, the database wiring is not yet shipped.

The image generation models keep moving. The image bake-off tool runs blinded pairwise comparisons between models, prompts, and quality tiers on a fixed visual canon. We use it to decide when to swap a generation provider or change a default quality setting. The results inform the cost ladder above.

Per-judge calibration is still mostly a human read. We have axis scores; we do not yet have a model that says "when judge A says 4.5 and judge B says 4.0, ship; when both say 4.0, retry." Building it is downstream of having enough labeled paired data, which the calibration bake-off produces.

Honest open question: at what point does the dipole vector run out of expressiveness? Five axes is a small space. We picked them by hand. The Bayesian engine would happily handle 25, but the UX of asking parents to rate themselves on 25 axes is bad. The right answer is probably 5 visible axes plus an embedding-augmented latent space behind them, but that is not yet built.

End of the technical tour

The product is the easy part to look at

The board book that arrives at your door is the leaf of every system above. If you want to see what comes out at the end, build a free first book in five minutes.

Build your free first book

How the engine that makes your books actually works