visual language models - Robuta Search

https://openreview.net/forum?id=tFhNhTGD6b&referrer=%5Bthe%20profile%20of%20Huizi%20Mao%5D(%2Fprofile%3Fid%3D~Huizi_Mao1) VILA: On Pre-training for Visual Language Models | OpenReview Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning... visual language models pre training vila openreview https://openreview.net/forum?id=TjFn6xktTm&referrer=%5Bthe%20profile%20of%20Ziyan%20Wu%5D(%2Fprofile%3Fid%3D~Ziyan_Wu1) Cross-Class Domain Adaptive Semantic Segmentation with Visual Language Models | OpenReview This paper addresses the issue of cross-class domain adaptation (CCDA) in semantic segmentation, where the target domain contains both shared and novel classes... visual language models cross class semantic segmentation domain adaptive https://openreview.net/forum?id=wCXAlfvCy6 LongVILA: Scaling Long-Context Visual Language Models for Long Videos | OpenReview Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution... visual language models long context for videos scaling openreview https://arxiv.org/abs/2405.20773 [2405.20773] Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via... Abstract page for arXiv paper 2405.20773: Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character https://openreview.net/forum?id=LjnDqVcrE9&referrer=%5Bthe%20profile%20of%20Jiale%20Li%5D(%2Fprofile%3Fid%3D~Jiale_Li5) ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models | OpenReview In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through learnable latent variable... large language models free visual https://research.google/blog/visual-captions-using-large-language-models-to-augment-video-conferences-with-dynamic-visuals/ Visual captions: Using large language models to augment video conferences with d Posted by Ruofei Du, Research Scientist, and Alex Olwal, Senior Staff Research Scientist, Google Augmented Reality Recent advances in video confere... large language models https://openreview.net/forum?id=LwOfVWgEzS&referrer=%5Bthe%20profile%20of%20Chang%20Liu%5D(%2Fprofile%3Fid%3D~Chang_Liu17) Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via... Although pre-trained models such as Contrastive Language-Image Pre-Training (CLIP) show impressive generalization results, their robustness is still limited... large language models machine vision https://openreview.net/forum?id=jSxU7ZGe3B&referrer=%5Bthe%20profile%20of%20Eric%20Schulz%5D(%2Fprofile%3Fid%3D~Eric_Schulz1) Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models |... Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior,... https://huggingface.co/papers/2310.12973 Paper page - Frozen Transformers in Language Models Are Effective Visual Encoder Layers Join the discussion on this paper page language models