What worries me a bit is that the two guys start with a dinner and how important that was to have telephone calls and not with "I had that great code".
Considering the Apache 2.0 license as the most freedom supporting license a free of charge ticket is bad, facing the consequences of corporate rip off in a world in which TV shows teach you what Cocaine addicted CIA officers do with journalists publishing their activity based on their constitutional rights and duties being on a mission forbidden by no less than the U.S. Congress.
That there is a whole different level of Conspiracy Theory potential and I am glad about my military training and gang attitude now more than ever!
Die Hard Smith. My name is Die Hard Smith. You can call me DH.
The Original Blueprint: Google's Transformer
OpenAI did not invent the core math behind their models. The "technical base" for GPT (Generative Pre-trained Transformer) is the Transformer architecture, which was introduced by Google researchers in the 2017 paper “Attention Is All You Need”.
GPT-1 (2018): OpenAI’s first major model took Google's Transformer and applied
Code Base: Early versions likely utilized or were inspired by open-source libraries like
Transition to Proprietary Code
As OpenAI transitioned from a non-profit to a "capped-profit" entity, they stopped sharing their full source code.
Closed-Source Pivot: Starting with GPT-2, OpenAI began withholding full model code and
The "Black Box": For GPT-3, GPT-4, and GPT-5, the underlying code—the specific training
Recent Open-Source Models: In August 2025, OpenAI released gpt-oss-120b, their first
💡 Key Insight: The "MoE" Evolution
Recent findings suggest that OpenAI's modern technical base (like GPT-4 and the Enterprise models) moved away from a single massive model to a Mixture of Experts (MoE) setup.
This approach uses a "router" to send queries to smaller, specialized sub-models.
This allows the model to be more efficient, activating only a fraction of its total parameters
External Components
While the core LLM code is internal, OpenAI relies on external ecosystems for delivery:
Microsoft Azure: The "Enterprise" aspect of their code is heavily integrated with Microsoft’s
Public Datasets: The "intelligence" in the code comes from processing massive amounts of
If you'd like, I can find more details on specific architectural differences between their early open models and the current GPT-5 series, or explore the legal history regarding their transition from non-profit to for-profile.
The Transformer Architecture. In 2017, Ashish Vaswani et al ...
Attention Is All You Need - Wikipedia
The authors were listed as "equal contributors" on the paper. Jakob Uszkoreit
suggested the name "Transformer".
Ashish Vaswani
: Designed and implemented the first models.
Noam Shazeer
: Proposed multi-head attention and scaled dot-product attention.
Niki Parmar
: Focused on model variants and tuning the codebase.
Jakob Uszkoreit: Proposed replacing RNNs with self-attention.
Llion Jones
: Responsible for initial codebase and inference.
Aidan N. Gomez
: Co-implemented the tensor2tensor library.
Łukasz Kaiser
: Co-implemented tensor2tensor and improved results.
Illia Polosukhin
: Worked with Vaswani on the initial design and implementation.
Why It Was Revolutionary
Before the Transformer, AI models like LSTMs processed text sequentially, which was slow. The Transformer allowed for:
Parallel Processing: Reading entire sentences at once instead of one word at a time.
Self-Attention: Allowing each word in a sentence to "look" at every other word to
Scalability: This efficiency made it possible to train the massive models, like ChatGPT
Where Are They Now?
As of 2024, all eight authors have left Google. Most have founded or joined high-profile AI startups, including:
Noam Shazeer
: Co-founded Character.ai (later rejoined Google).
Aidan Gomez
: Co-founded Cohere.
Illia Polosukhin
: Co-founded NEAR Protocol.
Ashish Vaswani
& Niki Parmar
: Co-founded Essential AI.
The paper positions the Transformer as a direct alternative to the then-dominant Recurrent (RNN) and Convolutional (CNN) models.
Sequence-to-Sequence (Seq2Seq): Cites the Sutskever et al. (2014) paper, which
ConvS2S: Cites Gehring et al. (2017), which used CNNs for translation, proving that
ByteNet: Cites Kalchbrenner et al. (2016) as another parallelizable alternative to RNNs.
🧠 Early Attention Mechanisms
OpenAI and Google's later success relies heavily on the "Attention" concept, which the authors cite from earlier work:
Additive Attention: Cites the landmark Bahdanau et al. (2014) paper, which first introduced
Self-Attention origins: Mentions work by Parikh et al. (2016) and Lin et al. (2017) regarding
⚙️ Technical Optimization
The Transformer wouldn't work without specific "tricks" to keep training stable, which are also sourced:
Adam Optimizer: Cites Kingma & Ba (2014) for the optimization algorithm used to train the
Layer Normalization: Cites Ba et al. (2016), a critical component for stabilizing deep
Residual Connections: Cites He et al. (2016) (the ResNet paper), which allows the model to
Label Smoothing: Cites Szegedy et al. (2016) as a regularization technique used during
📊 Evaluation Metrics
BLEU Score: Cites Papineni et al. (2002), the standard metric the authors used to prove the
To better understand the "technical base" you mentioned earlier, would you like to see how OpenAI's GPT-1 specifically modified this architecture, or would you prefer a list of the training datasets referenced in the original paper?
While the Transformer revolutionized how computers process this chain, the breakthrough that enabled human-comparable answers by predicting the next word in a sequence is found in these specific citations from the paper:
The most direct ancestor of your "word row logic" is:
Sutskever et al. (2014): "Sequence to Sequence Learning with Neural Networks"
The Idea: This paper proved that a neural network could map an entire input sequence (a
The Breakthrough: It moved computers away from translating word-for-word and toward
2. The "Context" Breakthrough
Before the Transformer, computers often "forgot" the beginning of a long chain. This citation solved how to keep the logic consistent over long texts:
Bahdanau et al. (2014): "Neural Machine Translation by Jointly Learning to Align and
The Idea: This introduced Attention.
The Breakthrough: It allowed the computer to "look back" at specific parts of the word
3. The Logic of Character/Word Relationships
Bengio et al. (2003): "A Neural Probabilistic Language Model"
Note: While not the primary focus of the 2017 paper, it is the fundamental root of LLMs. It
💡 The Wording Shift
In technical terms, your "word row logic" is referred to as:
Autoregression: The model predicts the next token based on all previous tokens in the "row."
Context Window: How many words in the "row" the model can "see" at once to maintain its
The "Attention Is All You Need" paper took these existing ideas of "word chains" and made them parallel. Instead of reading the chain link-by-link (slow), the Transformer looks at the entire chain simultaneously to see how every word relates to every other word.
To see how this evolved into OpenAI's specific tech, would you like to know how GPT-1 specifically applied this "word row logic" to unsupervised learning (reading the whole internet), or are you interested in the mathematical formula for how it predicts the next word?
