The recent progress and advancement of Large Language Models has experienced a significant increase in vision-language reasoning, understanding, and interaction capabilities. Modern frameworks achieve this by projecting visual signals into LLMs or Large Language Models to enable their ability to perceive the world visually, an array of scenarios where visual encoding strategies play a crucial role. However, real-world images not only contain a wide range of scenarios, they also vary significantly in terms of resolutions and aspect ratios, posing significant challenges for LLMs across different domains and tasks. To tackle the significant variance posed by real-world images, modern large language models perceive images in a low resolution i.e. 224×224, and a fixed aspect ratio i.e. 1:1. Although making the compromise to stick with low resolution and fixed aspect ratio increases the generalizability of the LLM in real-world applications, it often blurs the contents of the image significantly while also resulting in severe shape distortion. The compromise significantly impacts the abilities of the large multi-modality models or LMMs especially the ones optimized for fine-grained tasks including optical character recognition, and small object understanding. Furthermore, since the resolution and the aspect ratio are pre-determined, the models can only make the best guesses to the blurred images, resulting in model hallucinations, a situation under which the model produces textual responses that are not grounded factually in the images.
In this article, we will be talking about LLaVA-UHD, a novel approach that first takes the LLaVA-1.5 and the GPT-4V frameworks as representative examples, and attempts to expose the systematic flaws rooted in their visual encoding strategy. The LLaVA-UHD framework, a multimodal modal, is an attempt to address the challenges. The LLaVA-UHD framework can perceive images in high resolution as well as in any aspect ratio. The LLaVA-UHD framework is built around three key components. First, an image modularization strategy that divides native-resolution images into smaller variable-sized slices in an attempt to enhance efficiency and extend encoding. Next, a compression module that condenses image tokens produced by visual encoders further. Finally, a spatial schema that organizes slice tokens for the large language models. Comprehensive experiments indicate that the LLaVA-UHD framework is able to outperform state of the art large language models on 9 benchmarks. Furthermore, by using only 94% inference computation, the LLaVA-UHD framework is able to support images with 6 times larger resolution i.e 672×1088.
Vision-Language reasoning, understanding, and interaction have made significant progress of late, largely due to the recent push for Large Language Models. In modern frameworks, the same is accomplished by feeding visual signals into LLMs (Large Language Models) to make them capable of interpreting the real world visually, a diverse range of scenarios that rely on visual encoding strategies. The difference in scenario reflects a narrow coverage of LLMs across different domains and tasks, whilst the difference in resolutions and aspect ratios reveals the large intraclass variations in the real-world images which are hard to handle. Unlike the small scale that lowers the variance, models after BERT tackle the significance from the low resolution (e.g., for the LLaVA-UHD it’s 224×224) of images with a fixed aspect ratio, 1:1 to give real-world images. While this compromise is useful for ensuring the generalizability of the LLM to real-world applications, it often leads to very blurry images while promoting severe shape distortion. This reduces the capabilities of the big multi-modality models or LMMs (e.g., fine-grained tasks), such as optical character recognition and small object understanding. Since the resolution and the aspect ratio are pre-defined, the models can only guess the blurred images, leading to model hallucination, making the final generated textual responses not factually grounded in the images. So why don’t benchmark LMMs models perceive images in high resolutions and varied aspect ratios?
There are two major reasons why benchmark LMMs are unable to perceive images with high resolution and varied resolution. First, since visual encoders are pre-trained in fixed resolutions, it makes it difficult for the model and encoder to deal with images with varying aspect ratios and resolutions, thus significantly impacting the adaptability of the model. Second, encoding high-resolution images directly using vision transformers is associated with significant computing cost with respect to the size of the images. Furthermore, the computation costs might be significantly higher for the large language model to process a large number of visual tokens for high-resolution images, thus significantly impacting the overall efficiency of the model. To counter these challenges, the LLaVA-UHD, a large multimodal model that perceives high resolution images and any aspect ratio, takes the LLaVA-1.5 and the GPT-4V frameworks as representative examples, and attempts to expose the systematic flaws rooted in their visual encoding strategy.
The above image reflects on the experimental results of the GPT-4V in identifying the number of objects within an image. At its core, the LLaVA-UHD framework has three components. First, an image modularization strategy that divides native-resolution images into smaller variable-sized slices for extensible and efficient coding. Contrary to the recent LLMs that fit images into several fixed resolutions and aspect ratios, the variable-sized slices generated by the LLaVA-UHD framework enables full adaptivity to the native-resolution images without distorting shapes, resizing, or padding. Second, the model condenses the visual tokens by a compression layer to modest length, resulting in reducing the computation for LLMs significantly. Finally, the model organizes the compressed slice tokens in a spatial schema to inform the slice positions in the images to the large language model.
LLaVA-UHD : Methodology and Architecture
On the basis of the learnings from some pilot experiments to study existing frameworks including GPT-4V and LLaVA-1.5, the LLaVA-UHD framework implements a three component architecture as demonstrated in the following image.
First, an image modularization strategy that divides native-resolution images into smaller variable-sized slices in an attempt to enhance efficiency and extend encoding. Next, a compression module that condenses image tokens produced by visual encoders further. Finally, a spatial schema that organizes slice tokens for the large language models. Let’s have a detailed look into these components.
Modularized Visual Encoding
A common approach to deal with high-resolution images with different aspect ratio is to interpolate the position embeddings of the Vision Transformer or ViT to the target shape for direct encoding as a whole. However, the implementation of this approach is often accompanied with high computation costs, and out of distribution issues result in further performance degradation. To tackle this challenge, the LLaVA-UHD framework presents a modularized visual encoding strategy that basically aims to divide native resolution images into smaller variable-sized slices where the shape of each slice is quite close to the standard pre-training setting of the vision transformer. Owing to the use of variable-sized slice slices, the LLaVA-UHD framework is able to achieve full adaptability to native resolution images without implementing any shape-distorting reshaping or padding. Furthermore, the primary goal of the image slicing strategy is to determine a split of high resolution images with minimal changes to the resolutions of each slice. For a given image with a certain resolution (w,h), and a vision transformer pre-trained in another resolution, the LLaVA-UHD framework first determines the ideal computation i.e. the number of slices required to process the image. The framework then factorizes the number of slices into m columns and n rows. The framework then defines a score function to measure the deviation from the standard pre-training setting of the vision transformer. Theoretically, the LLaVA-UHD framework is able to demonstrate the partition strategy implemented in its architecture guarantees minor expected changes and modest worst-case changes with respect to standard pretraining resolution for each slice.
Furthermore, a majority of existing LLMs implement a static resolution for image slice encoding, an approach that prevents the full adaptability of the model to native resolutions since they have access only to several predefined fixed shape slices. Additionally, static slice resolution hurts the performance, efficiency, and the correctness of the model since it incurs shape-distorting resizing or padding inevitably. To tackle this issue, the LLaVA-UHD framework proposes to encode image slices in aspect ratio as defined by the partition strategy. To be more specific, the LLaVA-UHD framework first resizes the original image proportionally in accordance with the aspect ratio in a way that the number of patches fits within the pre-training budget i.e. the number of position embedding sequence in the vision transformer, maximally. The LLaVA-UHD model then reshapes the pre-trained 1D position embedding sequence of the vision transformer into a 2D format in accordance with its pre-training settings.
Compression Layer
A common issue LLMs face when processing high-resolution images is that the amount of visual tokens they have to process is significantly higher(for reference, the LLaVA-1.5 framework produces around 3500 visual tokens when processing a single image with resolution: 672×1008), accounting for a major part of the computational resources and cost. To account for this challenge, the LLaVA-UHD model implements a shared perceiver resampler layer to compress the visual tokens of each image slice. The model then implements a set of query vectors via cross-attention to resample the output of image tokens by the visual encoders to a lower number. When compared against prevalent Multilayer Perceptron-based visual projection strategies, the perceiver sample approach implemented by LLaVA-UHD is able to maintain an affordable yet fixed number of visual tokens irrespective of its image resolution, making the LLaVA-UHD framework more compatible with high-resolution image processing and understanding tasks. To put that into picture, the LLaVA-UDH framework generates the same amount of tokens when encoding a 672×1008 resolution image as the LLaVA-1.5 framework generates when encoding a 336×336 resolution image, nearly 6 times more effective than its competitor.
Spatial Schema for Image Slices
It is a necessary practice to inform the large language model of the spatial organizations of image slices since the partitioning of images is dynamic across different images. The LLaVA-UHD framework designs and implements a spatial schema that uses two special tokens to inform the LLM of the relative position of the image slices. Under this spatial schema, the LLaVA-UHD framework uses “,” to separate the slice representations in a row, and the different rows are separated using a “\n”.
LLaVA-UDH : Experiments and Results
The LLaVA-UHD framework is evaluated against 9 popular benchmarks including general visual question answering benchmarks, optical character based visual question answering benchmarks, hallucination benchmark, and comprehensive benchmarks. Furthermore, the LLaVA-UHD framework is compared against strong baselines including LLaVA-1.5, MiniGPT-v2, InstructBLIP, BLIP-2, and more.
The performance of the LLaVA-UHD framework on 9 popular benchmarks is summarized, and compared against popular benchmarks in the table below.
On the basis of the above performance, it can be concluded that the LLaVA-UHD framework is able to outperform strong baseline models on popular benchmarks including strong general baselines trained on a significantly larger amount of data, along with outperforming LLMs that need significantly more computation like Fuyu-8B, Monkey, and more. Second, the results also indicate that the LLaVA-UHD framework achieves significantly better results over the LLaVA-1.5 architecture, and on one hand where LLaVA-1.5 supports a fixed 336×336 resolution, the LLaVA-UHD framework supports 672×1088 resolution images with any aspect ratio, and the same number of visual tokens.
Final Thoughts
In this article we have talked about LLaVA-UHD, a novel approach that first takes the LLaVA-1.5 and the GPT-4V frameworks as representative examples, and attempts to expose the systematic flaws rooted in their visual encoding strategy. The LLaVA-UHD framework, a multimodal modal, is an attempt to address the challenges. The LLaVA-UHD framework can perceive images in high resolution as well as in any aspect ratio. The LLaVA-UHD framework is built around three key components. First, an image modularization strategy that divides native-resolution images into smaller variable-sized slices in an attempt to enhance efficiency and extend encoding. Next, a compression module that condenses image tokens produced by visual encoders further. Finally, a spatial schema that organizes slice tokens for the large language models. Comprehensive experiments indicate that the LLaVA-UHD framework is able to outperform state of the art large language models on 9 benchmarks. Furthermore, by using only 94% inference computation, the LLaVA-UHD framework is able to support images with 6 times larger resolution i.e 672×1088.