Abstract
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_finetuning_modernbert_argilla_828e0d3969.png?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_finetuning_modernbert_philschmidt_0d32e4f3eb.png?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_modernbert_anserai_a65c02643c.png?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_ai_bubble_thumbnail_8909f3f6f8.png?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_LMSYS_arena_cf9d4a89a6.png?w=384&q=75)