Thinking Beyond Text: A New Approach to Visual Reasoning

Author: Denis Avetisyan

Researchers have developed a modality-agnostic method that allows AI systems to perform complex visual reasoning without relying heavily on language as an intermediary.

Explicit modality switching in thought processes proves unreliable, often failing to adequately ground textual information in visual context or to correctly identify optimal times for transitioning between image and text latents; this necessitates a modality-agnostic approach, such as Mull-Tokens, which allows the model to dynamically allocate latent space as needed, bypassing the limitations of predetermined switching mechanisms and enabling more robust cross-modal reasoning.

Mull-Tokens leverage latent tokens to create modality-agnostic reasoning traces, improving performance and efficiency in multimodal learning tasks.

Effective reasoning extends beyond language, demanding the integration of spatial, temporal, and embodied knowledge-a challenge for current multimodal models which often rely on cumbersome tool calls or handcrafted data. This paper introduces ‘Mull-Tokens: Modality-Agnostic Latent Thinking’, a novel approach utilizing pre-trained, modality-agnostic latent tokens to facilitate free-form, internal reasoning in both image and text. We demonstrate that Mull-Tokens improve performance across spatial reasoning benchmarks-achieving up to a 16% gain on puzzle-solving tasks-by abstractly representing information regardless of modality. Could this simple yet effective mechanism unlock more robust and scalable multimodal reasoning capabilities in large language models?

The Limits of Conventional Visual Reasoning

Conventional artificial intelligence systems often falter when confronted with tasks demanding nuanced visual comprehension and reasoning, a limitation particularly pronounced when dealing with spatial relationships. These systems frequently excel at recognizing objects within an image, but struggle to interpret how those objects relate to one another – determining if one is above, below, inside, or adjacent to another, for example. This difficulty stems from the fact that traditional AI relies heavily on pixel-based analysis, lacking the higher-level cognitive abilities humans employ to construct a three-dimensional understanding of a scene. Consequently, tasks that seem effortless for humans – such as assembling furniture from diagrams or navigating a room – present a significant hurdle for current AI technologies, highlighting a crucial gap in their ability to process visual information with the same flexibility and intuitiveness as biological vision.

Current artificial intelligence systems frequently falter when confronted with problems demanding the synthesis of visual perception and linguistic understanding. While adept at processing either modality in isolation, bridging the gap between “seeing” and “reading” remains a significant hurdle. These methods often treat visual and textual data as separate streams, hindering their ability to, for example, answer questions about an image requiring inferences based on both visual details and external knowledge. This limitation manifests in difficulties with tasks like visual question answering, image captioning with nuanced meaning, and even robotic navigation in complex environments where understanding both the visual layout and textual instructions is crucial. Consequently, achieving true artificial general intelligence necessitates developing architectures capable of seamlessly integrating and reasoning across these distinct, yet complementary, forms of information.

Our modality-agnostic approach, utilizing Mull-Tokens, provides a simplified method for visual question answering within the vision-language domain, offering an alternative to existing text- or image-text reasoning techniques.

Introducing Mull-Tokens: A Foundation for Internalized Reasoning

Mull-Tokens are introduced as modality-agnostic latent tokens designed to function as an internal reasoning mechanism within multimodal language models. These tokens represent an intermediary computational space, allowing the model to perform iterative reasoning steps separate from direct input-output processing. Unlike methods tied to specific data types, Mull-Tokens operate independently of input modality – be it text, image, or audio – and facilitate a decoupled approach where intermediate thoughts are stored and manipulated as discrete or continuous embeddings. This internal ‘scratchpad’ enables the model to break down complex tasks into manageable steps, improving both the accuracy and interpretability of its reasoning process.

Representing intermediate reasoning steps as discrete or continuous Mull-Tokens allows multimodal language models to move beyond direct input-output associations. This decoupling enables the model to maintain and refine an internal state representing its thought process, rather than relying solely on the initial input. Consequently, the model’s reasoning becomes more transparent, as these tokens act as an explicit record of intermediate conclusions. This internal representation also improves accuracy by facilitating iterative refinement; the model can revisit and correct earlier steps in its reasoning process, leading to more reliable outputs and reducing reliance on a single, potentially flawed, initial assessment.

Mull-Tokens are designed to accommodate both discrete and continuous embedding types, offering versatility in model architecture and data representation. Discrete embeddings, typically represented as one-hot vectors or learned token IDs, facilitate symbolic reasoning and explicit state tracking. Conversely, continuous embeddings, utilizing real-valued vectors, enable the representation of nuanced information and allow for gradient-based optimization during training. This dual support allows developers to select the embedding type best suited to the specific reasoning task and model constraints, or even combine both approaches to leverage their respective strengths. The ability to manipulate information in either discrete or continuous vector spaces enhances the model’s capacity for complex reasoning and facilitates integration with diverse multimodal inputs and outputs.

Pre-training multimodal tokens to represent both image and text reasoning traces is crucial for achieving optimal performance, surpassing methods relying on tokens as mere compute or solely on text.

Empirical Validation: Mull-Tokens Across Diverse Benchmarks

Experiments utilizing the Qwen2.5-VL multimodal language model show performance improvements across multiple benchmarks. Specifically, the model achieved a +3% average accuracy gain when evaluated on the BLINK, SAT, and VSI-Bench datasets. These benchmarks represent a range of challenges, including knowledge-based visual question answering (BLINK), symbolic reasoning (SAT), and comprehensive visual scene understanding (VSI-Bench). The observed gains indicate Mull-Tokens contribute to enhanced performance on complex multimodal tasks as assessed by these standardized evaluation metrics.

To facilitate reasoning with Mull-Tokens, the Qwen2.5-VL model utilizes Interleaved Image-Text processing, where visual and textual inputs are processed in an alternating fashion to encourage cross-modal attention. This is coupled with Chain-of-Thought (CoT) prompting, a technique that encourages the model to generate intermediate reasoning steps before arriving at a final answer. By presenting both visual and textual information in an interleaved manner and explicitly prompting for a step-by-step rationale, the model is guided to more accurately leverage the information encoded within the Mull-Tokens and improve performance on complex reasoning tasks.

Evaluations demonstrate that the incorporation of Mull-Tokens results in performance improvements on tasks demanding spatial reasoning and video understanding capabilities. Specifically, gains were observed on the ERQA dataset and benchmarks designed to assess Video Spatial Reasoning. Quantitative analysis indicates a +16% accuracy increase on visual puzzle splits that heavily emphasize reasoning processes, suggesting Mull-Tokens are particularly effective when complex inferential steps are required to solve problems presented in visual formats.

The model effectively combines visual Mull-Tokens with text reasoning to solve tasks, intelligently choosing to rely solely on visual cues when textual descriptions are unhelpful, as demonstrated by its ability to both reason with descriptions and directly predict answers based on visual information.

Refining the Algorithmic Core: Gradient-Based Optimization of Reasoning

The reasoning process within the model is significantly enhanced through the implementation of Gradient-based Reward Propagation Optimization (GRPO). This technique directly optimizes the Mull-Token representations, effectively tuning the internal thought process to better align with desired outcomes. Unlike traditional methods that rely on adjusting the model’s overall parameters, GRPO focuses specifically on these intermediate reasoning tokens, allowing for a more precise and efficient learning signal. By propagating reward signals back through the reasoning chain, the model refines its internal strategies, leading to improved performance on complex visual tasks and a substantial reduction in computational cost; the system achieves comparable or superior results with only 10-40 tokens, a marked improvement over the hundreds typically required by existing text-based Chain-of-Thought or image representation techniques.

The refinement of internal reasoning strategies represents a significant advancement in the model’s ability to tackle complex visual tasks. Through optimized learning, the model doesn’t simply process visual information; it develops a more nuanced and effective approach to problem-solving. This is achieved by directly shaping the representations used during reasoning, allowing the system to prioritize relevant information and discard noise. Consequently, the model exhibits improved accuracy and efficiency when confronted with challenging visual problems, demonstrating a capacity to not only ‘see’ but also ‘understand’ the intricacies of the input. The impact extends beyond simple recognition, enabling more sophisticated interpretations and deductions from visual data.

A significant advancement lies in the model’s ability to separate the reasoning process from its core parameters. This decoupling fosters a more efficient learning dynamic, allowing the system to generalize solutions to novel situations without extensive retraining of the entire network. Consequently, the computational demands are substantially reduced; the model achieves comparable or superior performance on complex visual tasks utilizing only 10 to 40 tokens to represent its reasoning, a dramatic decrease from the hundreds of tokens previously required by conventional text-based Chain-of-Thought prompting or image representations. This streamlined approach not only accelerates learning but also makes the model more accessible for deployment in resource-constrained environments.

Latent design performance is optimized with discrete embeddings and a moderate number of latent tokens, particularly after GRPO training, which encourages causal relationships within the latent chain.

The pursuit of robust reasoning, as demonstrated by Mull-Tokens, echoes a fundamental tenet of algorithmic design: elegance through mathematical structure. This work transcends the limitations of simply ‘getting the right answer’ by focusing on the process of reasoning, manifested in the latent tokens that represent modality-agnostic thought. As Yann LeCun aptly states, “Everything we are doing in deep learning is about building function approximators.” Mull-Tokens doesn’t merely approximate a solution; it constructs a traceable, interpretable reasoning pathway – a provable sequence of steps, irrespective of the input modality. This emphasis on internal consistency and demonstrability aligns with the principle that a correct algorithm should be verifiable, not just empirically successful, offering a pathway towards truly intelligent systems.

What Remains to be Proven?

The introduction of Mull-Tokens represents a pragmatic step, trading theoretical elegance for demonstrable gains in multimodal reasoning. However, the reliance on learned latent tokens, while empirically effective, begs the question of what is actually being abstracted. Are these tokens approaching a genuinely modality-agnostic representation of reasoning, or merely a sophisticated compression of correlational patterns within the training data? The current work establishes a performance benchmark, but does not offer a principled justification for the token space itself.

Future investigations must address the limits of this abstraction. Can Mull-Tokens generalize to reasoning tasks requiring compositional understanding – problems where the order of operations demonstrably alters the solution? Or will the system, like so many before it, remain tethered to surface-level heuristics? The true test will not be achieving high scores on existing datasets, but demonstrating robustness to adversarial examples designed to expose the underlying lack of logical grounding.

Ultimately, the field must resist the temptation to equate correlation with cognition. Mull-Tokens, and similar approaches, offer a powerful tool for simulating reasoning. But the pursuit of genuine artificial intelligence demands a commitment to formal verification and provable correctness – a standard to which empirical performance alone can never aspire.

Original article: https://arxiv.org/pdf/2512.10941.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Conventional Visual Reasoning

Introducing Mull-Tokens: A Foundation for Internalized Reasoning

Empirical Validation: Mull-Tokens Across Diverse Benchmarks

Refining the Algorithmic Core: Gradient-Based Optimization of Reasoning

What Remains to be Proven?

See also: