![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/clip_as_rnn_teaser_9ff8c3943c.png?w=3840&q=75)
Abstract
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask annotations and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. As a result, the open-vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions when there are text queries referring to non-existing concepts in the image. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a VLM with frozen weights. Thus, our model retains the VLM's broad vocabulary space and strengthens its segmentation capability. Experimental results show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of additional data samples, and sets new state-of-the-art records for both zero-shot semantic and referring image segmentation tasks. Specifically, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_modernbert_anserai_a65c02643c.png?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_empirical_evaluation_of_llms_for_solving_offensive_security_challenges_6c556f5a98.png?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_spar_personalized_content_based_recommendation_da50ac5d0a.png?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_generative_representational_instruction_tuning_dfe520d8f1.jpeg?w=384&q=75)
![](https://pitti-backend-assets.ams3.digitaloceanspaces.com/thumbnail_nomic_embed_training_a_reproducible_long_context_text_embedder_92674bb41f.png?w=384&q=75)