[Gardner Analytics Office — December 2014, 6:14 AM]
The couch. The sag in the center. The particular smell of upholstery that had absorbed eleven months of coffee spills, pizza grease, and the specific chemical signature of engineers who'd worked sixteen-hour days without remembering to shower.
Ethan opened his eyes to the pressure.
Different from the Gen 2 unlock — that had been an expansion, a room growing larger while he stood inside it. This was denser. Heavier. Like a library shelf filling with books he hadn't ordered, each spine labeled with a title he could read but hadn't yet opened.
He closed his eyes again. Let the architectural awareness settle.
The Transformer: Phase 2. Stable. Published. The foundation that everything rested on, clear in his mind with the permanence of a building he'd not only visited but helped construct.
GPT-1: Phase 2, approaching Phase 3. The decoder-only architecture — implemented, trained, deployed, generating revenue through the documentation product. Two months of production use had deepened his understanding to the point where the architecture wasn't just visible but intuitive. He could mentally simulate small-scale forward passes through the decoder stack, predicting how the model would respond to specific prompts before running them. The Phase 3 transition was close — maybe weeks away, contingent on continued work with the architecture.
And now, beyond the GPT-1 horizon, four new structures materialized.
The first was massive. Not different in kind from GPT-1 — the same decoder-only design, the same causal attention, the same autoregressive generation. Different in scale. The blueprint showed dimensions that dwarfed the current model: forty-eight layers instead of twelve. Sixteen attention heads instead of twelve. An embedding dimension that had doubled and doubled again. The model was the same architecture, inflated to a size that would hold over a billion parameters.
GPT-2. Scale as innovation. The proposition that bigger models didn't just produce better text — they produced different text. Capabilities that didn't exist at smaller scales would emerge at parameter thresholds: multi-step reasoning, few-shot learning, the ability to adapt to new tasks through context alone. The architecture wasn't the innovation. The size was.
The second structure: an optimized bidirectional encoder. RoBERTa-type. The BERT concept he'd rejected in Generation 2, returned in evolved form — not the masking-based pre-training of the original, but a refined version with dynamic masking, larger batches, longer training, and the removal of the next-sentence-prediction objective that had been a drag on BERT's performance. Understanding, perfected.
Third: a permutation-based model. XLNet. Bidirectional context without masking, achieved through permutation-based training that considered all possible orderings of the input. Theoretically elegant. Computationally expensive. The kind of architecture that a research lab with unlimited compute might pursue.
Fourth: a parameter-efficient variant. ALBERT. Shared parameters across layers, reducing the model's memory footprint while preserving capability. A smaller, faster alternative for deployment-constrained environments.
Four paths. The same constraint. Choose one, lose the others until a future generation offered them again.
The choice was already made. Had been made, in a sense, since the moment he'd committed to generation over understanding whiteboard session. GPT-2 was the continuation — scale, capability, the road that led to GPT-3 and beyond, to the generative AI explosion that would reshape how humans and machines interacted.
Ethan sat up. The couch springs complained. Through the window, Folsom Street was still dark — the pre-dawn hour when San Francisco paused between its nightlife and its morning commute, the gap in the city's breathing that belonged to no one.
He crossed to the whiteboard. Erased the settlement strategy notes from the previous afternoon — Sarah would have photographed them already, because Sarah documented everything. Drew the GPT-2 architecture in blue marker, the same way he'd drawn every architecture since the first night in the dead man's apartment, translating the blueprint in his mind to lines on a surface.
Forty-eight decoder layers. Sixteen attention heads. Embedding dimension: 1600. Vocabulary: 100,000 tokens. Estimated parameter count: 1.5 billion. Estimated training time: 500+ GPU-hours on V100-equivalents. Estimated cost: $250,000 minimum.
A quarter of a million dollars for a single training run. Multiple runs would be needed for hyperparameter optimization. Total compute budget for GPT-2: $500,000 to $750,000.
Their bank account held $2.25 million. After the settlement legal costs, closer to $1.75 million. Monthly burn at $180,000 put the runway at approximately ten months without additional revenue or funding. GPT-2 training would consume thirty to forty percent of the remaining capital.
The math was tight. Not impossible — the documentation product was generating revenue, with two pilots signed and three more in negotiation. But revenue was growing linearly while costs were scaling exponentially. The gap between what the company earned and what the technology consumed was the gap that Series B funding needed to bridge.
Sarah arrived at seven-thirty. She stopped in the doorway, coffee in hand, and stared at the whiteboard the way she always stared at his architectural drawings — with the combination of professional admiration and personal suspicion that had defined their relationship since the first bug she'd spotted in his attention mechanism.
"You drew this before sunrise," she said. Not a question.
"The architecture resolved overnight."
"Forty-eight layers. 1.5 billion parameters." She walked to the whiteboard. Read the dimensional specifications. Her red marker appeared in her hand — reflexive, automatic, the way a doctor reaches for a stethoscope. "The training cost alone is—"
"A lot."
"A quarter million per run. And you're planning this with..." She did the mental math. "Ten months of runway."
"We need Series B before we train GPT-2 at full scale. But we can do preliminary runs at smaller scale — proof of concept, hyperparameter search — with the current budget."
"And you know the optimal dimensions because..."
"Scaling analysis from our GPT-1 training curves. The loss-per-parameter relationships we observed extrapolate to these configurations."
Sarah's red marker hovered over the dimensional annotations. She didn't draw. Didn't annotate. Stood still for five seconds, which was four seconds longer than Sarah typically stood still for anything.
"Lucky guesses," she said.
"Informed projections."
"Every time I ask how you know something, you have a different euphemism. Intuition. Projections. Analysis. Informed guesses." She capped the marker without using it. "Your vocabulary for 'I can't tell you' is impressive. I'll give you that."
The GPT-2 architecture blazed on the whiteboard. Blue lines, clean angles, the tower of decoder blocks reaching toward the ceiling of the office's available whiteboard space. The blueprint in Ethan's mind was sharper than the drawing — he could see the scaling curves, the parameter thresholds, the specific model sizes where emergent capabilities would appear. Information that wouldn't exist in any published form for years, that he carried as spatial knowledge the way other people carried the layout of their childhood homes.
"I'm choosing GPT-2," he said. "Scale. The same path we've been on, extended further. The model gets bigger. The output gets better. The capabilities that emerge at this scale will be unlike anything we've produced."
"The cost gets bigger too."
"That's what Series B is for."
Sarah picked up her coffee. Drank. Studied the whiteboard for another ten seconds. Then she set the coffee down, uncapped the red marker, and began annotating — dimensional specifications, memory requirements, estimated training throughput. The practical engineering of converting a decision into reality.
The Generation 3 unlock was complete. GPT-2 was chosen. The other three paths — RoBERTa, XLNet, ALBERT — faded from Ethan's architectural awareness over the course of the morning, retreating to the periphery where they'd remain inaccessible until a future generation offered similar options.
The team would need to grow. The compute budget would need to expand. The Series B would need to close. And underneath all of it, the fundamental tension — the gap between what Ethan knew and what he could explain — continued to widen with every generation unlock, every architectural choice that came from a source he couldn't reveal.
But the model was real. The architecture was real. The output was real. And the path from GPT-2 to GPT-3 to whatever came after was a path that, once walked, would prove its own validity through results that no explanation could improve upon.
Scale was the answer. The question was whether the money would last long enough to ask it.
Author's Note / Promotion: Your Reviews and Power Stones are the best way to show support. They help me know what you're enjoying and bring in new readers! You don't have to. Get instant access to more content by supporting me on Patreon. I have three options so you can pick how far ahead you want to be: 🪙 Silver Tier ($6): Read 10 chapters ahead of the public site. 👑 Gold Tier ($9): Get 15-20 chapters ahead of the public site. 💎 Platinum Tier ($15): The ultimate experience. Get new chapters the second I finish them . No waiting for weekly drops, just pure, instant access. Your support helps me write more . 👉 Find it all at patreon.com/fanficwriter1
