Logo LanP

LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models

Zongyu Wu1*, Yuwei Niu2*, Hongcheng Gao2, Minhua Lin1, Zhiwei Zhang1,
Zhifang Zhang2, Qi Shi3, Yilong Wang1, Sike Fu1, Junjie Xu1, Junjie Ao4,
Enyan Dai1, Lei Feng2, Xiang Zhang1, Suhang Wang1

1The Pennsylvania State University
2Singapore University of Technology and Design 3Peking University 4Rensselaer Polytechnic Institute

*Equal Contribution
†Corresponding Author
Contact: zongyuwu@psu.edu, niuyuwei04@gmail.com, szw494@psu.edu
geometric reasoning

An illustration of different roles of language priors in LVLMs. The left half of the image shows an example where language priors have a negative impact, while the right half shows an example where language priors bring a positive impact

🔥News

[02-19-2025]: Our paper is available on arxiv!

Abstract

Large Vision-Language Models (LVLMs) have shown impressive performance in various tasks. However, LVLMs suffer from hallucination, which hinders their adoption in the real world. Existing studies emphasized that the strong language priors of LVLMs can overpower visual information, causing hallucinations. However, the positive role of language priors is the key to a powerful LVLM. If the language priors are too weak, LVLMs will struggle to leverage rich parameter knowledge and instruction understanding abilities to complete tasks in challenging visual scenarios where visual information alone is insufficient. Therefore, we propose a benchmark called LanP to rethink the impact of Language Priors in LVLMs. It is designed to investigate how strong language priors are in current LVLMs. LanP consists of 170 images and 340 corresponding well-designed questions. Extensive experiments on 25 popular LVLMs reveal that many LVLMs' language priors are not strong enough to effectively aid question answering when objects are partially hidden. Many models, including GPT-4 Turbo, exhibit an accuracy below 0.5 in such a scenario.

LanP Benchmark

Overview

Our goal is to assess the positive role of language priors in LVLMs. Therefore, we selected questions where certain objects in the images are not very clear, challenging the models' ability to accurately interpret visual information, aiming to show that language priors can aid understanding when visual information is ambiguous. It could be hard for LVLMs to answer questions based only on visual information.

algebraic reasoning

In our dataset, the image could provide enough background/environment information for the model. However, certain objects remain unclear due to factors such as darkness and blurriness, making it difficult for the model to interpret and output the answer correctly. If the model can better utilize language priors to understand the overall context of the image, it can provide more accurate answers. We consider four types of scenarios: blur, environment, partially hidden, and tiny.

Experiment Results

Leaderboard

To fully evaluate the quality of LanP and understand the effects of language priors in visual question answering, we conduct comprehensive experiments on 25 representative closed-source and open-source LVLMs:

Model Env Acc Ph Acc Blur Acc Tiny Acc Overall Acc
GPT-4 Turbo0.78000.45000.78000.53330.6588
GPT-4o0.90000.77500.94000.83330.8706
GPT-4o mini0.82000.62500.88000.53330.7412
Gemini 1.5 Pro0.90000.72500.84000.83330.8294
Gemini 1.5 Flash0.94000.70000.92000.83330.8588
InternVL2.5-26B0.94000.75000.86000.86670.8588
InternVL2.5-8B0.90000.55000.94000.83330.8176
InternVL2.5-4B0.94000.55000.96000.86670.8412
InternVL2.5-2B0.94000.45000.88000.66670.7588
InternVL2.5-1B0.82000.40000.86000.66670.7059
InternVL2-26B0.92000.62500.88000.76670.8118
InternVL2-8B0.78000.45000.80000.70000.6941
InternVL2-4B0.82000.45000.74000.66670.6824
InternVL2-2B0.76000.25000.70000.33330.5471
InternVL2-1B0.76000.35000.82000.60000.6529
Cambrian-13B0.86000.47500.78000.63330.7059
Cambrian-8B0.90000.70000.84000.76670.8118
Mini-InternVL-Chat-4B-V1-50.78000.40000.80000.70000.6824
Mini-InternVL-Chat-2B-V1-50.88000.20000.82000.60000.6529
ShareGPT4V-13B0.84000.27500.86000.66670.6824
ShareGPT4V-7B0.82000.32500.58000.63330.6000

Overall results of different models on the LanP leaderboard. The best-performing model in each category is in-bold.

Reference


    @article{wu2025lanp,
      title={LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models},
      author={Zongyu Wu and Yuwei Niu and Hongcheng Gao and Minhua Lin and Zhiwei Zhang and Zhifang Zhang and Qi Shi and Yilong Wang and Sike Fu and Junjie Xu and Junjie Ao and Enyan Dai and Lei Feng and Xiang Zhang and Suhang Wang},
      journal={arXiv preprint arXiv:2502.12359},
      year={2025}
    }