From Zero to LLM: The Technical Journey of Training Models from Scratch
A comprehensive guide to the data, compute, and architectural considerations involved in building your own Large Language Model.

Models trained to understand both images and text, often called Vision-Language Models (VLMs), are dazzling us with their ability to describe scenes, answer questions about visual content, and even generate captions that are remarkably nuanced. Yet, behind this impressive facade, a persistent problem lurks: unpredictable behavior when encountering data outside their training distribution. A VLM might flawlessly caption a familiar park scene but falter entirely when presented with a stylized, artistic rendering of the same park, or misinterpret a common object due to an unusual lighting condition. This isn’t just an academic curiosity; it’s a direct threat to deploying these systems in real-world applications where data variability is the norm, not the exception.
Recent research argues that this unreliability isn’t an accidental emergent property of complex neural networks. Instead, it stems from specific, identifiable mechanistic pathways within the VLM’s architecture. Understanding these pathways is paramount for building truly robust and trustworthy AI systems. This post unpacks these mechanisms, moving beyond surface-level observations to the underlying computational processes that govern VLM reliability.
A key vulnerability in VLMs lies in how they bridge the gap between visual perception and linguistic representation. We can conceptualize this as a “compositional gap.” Imagine asking a VLM to describe a “red car parked next to a blue bicycle.” A reliable VLM needs to:
The problem arises when the visual input deviates significantly from its training data. Consider a VLM trained on photorealistic images. If it encounters a cartoonish depiction of a red car, its visual encoder might struggle to extract the canonical features associated with “car.” Similarly, if the objects are rendered with unnatural colors or in improbable spatial arrangements (e.g., a bicycle floating above the car), the model’s internal representations might become unstable.
The research suggests that VLMs often develop specialized, brittle pathways for handling specific visual-linguistic correlations. For instance, a model might learn a strong association between a visual pattern strongly resembling a “dog” and the word “dog.” However, if the visual input is subtly altered – perhaps a drawing of a dog with unusually long ears, or a dog rendered in an abstract artistic style – the learned pathway might fail. The model doesn’t necessarily possess a generalizable concept of “dog” that can adapt to variations; rather, it has memorized a specific visual signature.
This is particularly problematic when dealing with out-of-distribution (OOD) data. If a VLM has exclusively seen cars in clear daylight, it might fail to identify a car partially obscured by fog, or one under the dim, yellowish glow of streetlights. The visual features it relies on for object detection and attribute recognition are no longer reliably present. The model then has to fall back on weaker, more general associations, or attempt to ‘reason’ with incomplete or misleading visual information. This can lead to outputs like: “A blurry red object near a two-wheeled vehicle” instead of “A red car next to a bicycle.”
This fragility is not confined to object recognition. It extends to abstract concepts and reasoning. If a VLM is trained on captions that describe actions in typical scenarios (e.g., “a person is eating an apple”), it might struggle when the visual scene depicts an unusual eating gesture or an unexpected object being consumed. The compositional logic that underpins its language generation breaks down because the visual grounding is unreliable.
The core of many VLMs involves cross-attention mechanisms. These allow the model to weigh the importance of different parts of the image when generating a specific word, and vice versa. Think of it as the model “looking” at the relevant part of the image for each word it outputs.
The problem is that these attention mechanisms can become “bottlenecks” themselves, particularly under OOD conditions. If the visual encoder produces noisy or ambiguous representations due to unusual input, the cross-attention layers may latch onto spurious correlations or fail to effectively retrieve relevant visual features.
Consider an example: a VLM is asked to describe an image containing a “person holding a book.”
# Hypothetical VLM attention mechanism snippet (conceptual)
def calculate_attention(visual_features, text_embeddings):
# Simplified dot-product attention
attention_scores = torch.matmul(visual_features, text_embeddings.transpose(-1, -2))
attention_weights = torch.softmax(attention_scores, dim=-1)
context_vector = torch.matmul(attention_weights, visual_features)
return context_vector, attention_weights
If the visual_features extracted from an abstract painting of a person holding something don’t strongly correlate with the semantic meaning of “person” or “holding” in the model’s learned space, the attention_weights might become diffuse or incorrectly focused. The model might then generate text that doesn’t accurately reflect the visual input. For instance, if the visual features are ambiguous, the model might over-emphasize a “background” element, leading to a caption like “A person next to a blurry shape.”
This issue is exacerbated by modal collapse, where the model’s representations for different modalities become too similar, or one modality starts to dominate the other. In a reliable VLM, the visual and textual modalities should maintain distinct but complementary representations, with attention acting as a flexible bridge. When the bridge is brittle, or the modalities bleed into each other inappropriately, descriptive accuracy suffers. The model might start treating visual cues as textual, or vice versa, leading to nonsensical outputs. For example, if a VLM is shown an image with a prominent pattern that resembles text (even if it’s not actual text, like a wood grain pattern), it might try to “read” this pattern, leading to garbled output.
Furthermore, the order of information processing matters. Some VLMs process visual information first, then use it to condition text generation. Others employ more interleaved or iterative processes. When OOD visual data is presented, the early stages of processing can inject noise that propagates through the entire pipeline. If the initial visual feature extraction is flawed, subsequent attention mechanisms, no matter how sophisticated, will struggle to rectify the error. The “attention bottleneck” is thus not just about where attention is focused, but also about the quality of the information it is attending to.
Perhaps the most alarming manifestation of VLM unreliability is hallucination. This occurs when the model generates information that is not present in the visual input, or contradicts it, yet does so with high confidence. This isn’t simply making a mistake; it’s fabricating content.
The mechanistic basis for hallucination, according to this research, is tightly linked to the previously discussed compositional gaps and attention bottlenecks. When the model’s internal representations become uncertain due to OOD data, its learned generative pathways can still produce fluent text. However, the semantic grounding for this text is weak or absent. The model effectively “fills in the blanks” with plausible, but incorrect, information based on its training data’s statistical regularities, rather than actual visual evidence.
Imagine a VLM encountering an image of a person wearing a hat but no discernible glasses. If the training data frequently pairs “person wearing a hat” with “person wearing sunglasses,” the model might hallucinate the presence of sunglasses. The visual encoder failed to provide clear evidence of “no glasses,” and the language generation module, relying on its strong learned associations, invented them.
The “hallucination horizon” refers to the boundary beyond which the model’s generated output is likely to become unreliable. This horizon is not static; it shifts based on the degree of OOD variation. Subtle deviations might lead to minor inaccuracies, while radical deviations can trigger complete fabrication.
This phenomenon is particularly concerning because humans often interpret fluent and confident language as evidence of accuracy. A VLM stating, “The dog is wearing a red collar,” with high predictive probability, is difficult to dispute without careful visual inspection. However, if the dog in the image is actually bare-necked, the model has crossed its hallucination horizon.
This mechanistic understanding highlights critical trade-offs:
The core argument of this research is that reliability in VLMs is not a happy accident of scale or architecture. It is a consequence of how effectively the model’s internal mechanisms can maintain robust, grounded representations across diverse inputs. By understanding these pathways – the compositional gaps, the attention bottlenecks, and the resulting hallucination horizons – we can begin to engineer VLMs that are not just capable, but also dependable.