https://openreview.net/forum?id=G3aXjVAJjU
Natural Language Inference Improves Compositionality in Vision-Language Models | OpenReview
Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial...
natural language inferencevision modelsimprovescompositionalityopenreview