TII Unveils Falcon Perception: A New Transformer for Image Processing

In the current landscape of computer vision, the standard approach involves a modular architecture where a pre-trained vision encoder is used for feature extraction, paired with a separate decoder for task prediction. While effective, this architectural separation complicates scaling and bottlenecks the interaction between language and vision. The Technology Innovation Institute (TII) research team challenges this paradigm with Falcon Perception, a unified transformer with 600 million parameters. By processing image patches and text tokens in a shared parameter space from the very first layer, the TII team has developed an early-fusion stack that handles perception and task modeling with remarkable efficiency.

The core design of Falcon Perception is based on the hypothesis that a single transformer can simultaneously learn visual representations and perform task-specific generation. Unlike standard language models that use strict causal masking, Falcon Perception employs a hybrid attention strategy. Image tokens attend to each other bidirectionally to build a global visual context, while text and task tokens attend to all preceding tokens to enable autoregressive prediction. To maintain 2D spatial relationships in a flattened sequence, the research team uses 3D Rotary Positional Embeddings, making the model robust to rotation and aspect ratio variations.

The TII research team introduced several optimizations to stabilize training and maximize GPU utilization. Muon Optimization showed lower training losses and improved performance on benchmarks compared to standard AdamW. Using a scatter-and-pack strategy, the model processes images at native resolutions without wasting compute on padding. When multiple objects are present, Falcon Perception predicts them in raster order, which was found to converge faster and produce lower coordinate loss than random or size-based ordering.

The model utilizes multi-teacher distillation for initialization, distilling knowledge from DINOv3 and SigLIP2. Following initialization, the model undergoes a three-stage perception training pipeline, including in-context listing, task alignment, and long-context finetuning. During these stages, task-specific serialization is used, forcing the model to commit to a binary decision on an object’s existence before localization.

To measure progress, the TII team introduced PBench, a benchmark that organizes samples into five levels of semantic complexity. Falcon Perception significantly outperforms SAM 3 on complex semantic tasks, particularly showing a +21.9 point gain on spatial understanding. The TII team also extended this architecture to FalconOCR, a compact model for glyph recognition that demonstrates competitive results against larger OCR systems.

TII Unveils Falcon Perception: A New Transformer for Image Processing

Related articles

Google launches native Gemini app for Mac

Anthropic's redesigned Claude Code app and new business features

Google Introduces Gemini 3.1 Flash TTS with Enhanced Speech and Control