https://j-min.io/publication/perceivervl_wacv2023/
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention | Jaemin Cho
Sep 23, 2023 - Efficient VL modeling with Perceiver-based iterative cross-attentions - *[WACV 2023](https://nips.cc/Conferences/2021)*
efficient visionperceivervl
https://j-min.io/publication/tvlt_neurips2022/
TVLT: Textless Vision-Language Transformer | Jaemin Cho
Feb 11, 2025 - Vision-and-Language modeling without text, by using a transformer which takes only raw visual and audio inputs - *[NeurIPS 2022](https://nips.cc/) (Oral)*
vision languagejaemin cho
https://j-min.io/publication/vidlankd_neurips2021/
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer | Jaemin Cho
Sep 2, 2023 - Video-based grounding can improve diverse NLU tasks - *[NeurIPS 2021](https://nips.cc/Conferences/2021)*
language understandingvia
https://j-min.io/publication/video-skill-cot-findingsinemnlp2025/
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning | Jaemin Cho
Sep 16, 2025 - a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning
videoskillcotbasedchain
https://j-min.io/publication/vl-t5_icml2021/
Unifying Vision-and-Language Tasks via Text Generation | Jaemin Cho
generation jaemin chovia text
https://j-min.io/publication/clip-reward_findingsinnaacl2022/
Fine-grained Image Captioning with CLIP Reward | Jaemin Cho
Jan 12, 2025 - *[Findings of NAACL 2022](https://2022.naacl.org/)*
fine grainedjaemin choimage
https://j-min.io/publication/x-lxmert_emnlp2020/
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers | Jaemin Cho
May 12, 2024 - Text-to-Image Generation via predicting vector-quantized image patches with multimodal LMs - *[EMNLP 2020](https://2020.emnlp.org/)*
answer questionsmulti modalx
Sponsored https://www.milfy.com/
MILFY: Exclusive 4K Videos Featuring Stunning Mature Women
MILFY showcases gorgeous, confident women in premium cinematic scenes. Discover elegant, high-quality experiences with mature stars - captured in stunning 4K...
https://j-min.io/publication/layoutbench_cvprw2024/
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation | Jaemin Cho
Aug 11, 2024 - a New Diagnostic Benchmark (LayoutBench) and a new Baseline model (IterInpaint) for Layout-Guided Image Generation - *[CVPR Workshop...
image generationdiagnostic
https://j-min.io/publication/focus_emnlp2019/
Mixture Content Selection for Diverse Sequence Generation | Jaemin Cho
May 12, 2024 - Separate Diversification from Generation to improve both diversity and accuracy in sequence generation - *[EMNLP 2019](https://www.emnlp-ijcnlp2019.org/)*
generation jaemin chomixture
https://j-min.io/publication/lst_neurips2022/
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning | Jaemin Cho
Sep 2, 2023 - LST brings Memory efficiency into Parameter-efficient transfer learning - *[NeurIPS 2022](https://nips.cc/)*
memory efficientlstladderside
https://j-min.io/publication/videodirectorgpt_colm2024/
VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning | Jaemin Cho
Aug 6, 2024 - Using LLM (GPT-4) to generate a 'video plan' for consistent multi-scene video generation - *[COLM 2024](https://colmweb.org/)*
consistent multiscene video
Sponsored https://www.milfplay.com/
Milf Play OFFICIAL - Mature Dating @ Milfplay
Milfplay is the best dating site to find real local milfs for you to hook up with. Want to sext or trade pics? That's cool too. Video chat online before...
https://j-min.io/publication/vhcr_naacl2018/
A Hierarchical Latent Structure for Variational Conversation Modeling | Jaemin Cho
May 12, 2024 - Propose a hierarchical VAE model and utterance drop regularization to mitigate posterior collapse problem - *[NAACL 2018](http://naacl.org/naacl-hlt-2018/)*...
jaemin chohierarchicallatent
https://j-min.io/publication/docci_eccv2024/
DOCCI: Descriptions of Connected and Contrasting Images | Jaemin Cho
Jan 28, 2025 - High-quality, long, human-annotated descriptions of 15K images - *[ECCV 2024](https://eccv.ecva.net/)*
jaemin chodoccidescriptions
https://j-min.io/publication/sevila_neurips2023/
Self-Chained Image-Language Model for Video Localization and Question Answering | Jaemin Cho
language modelselfchained
https://j-min.io/
Jaemin Cho
Jaemin Cho Academic website.
jaemin cho
https://j-min.io/publication/hirest_cvpr2023/
Hierarchical Video-Moment Retrieval and Step-Captioning | Jaemin Cho
Sep 2, 2023 - HiREST is a holistic, hierarchical benchmark of multimodal retrieval and step-by-step summarization for a video corpus - *[CVPR...
jaemin chohierarchicalvideo
https://j-min.io/publication/vp-t2i_neurips2023/
Visual Programming for Text-to-Image Generation and Evaluation | Jaemin Cho
May 12, 2024 - Interpretable/explainable visual programming frameworks for T2I generation (VPGen) and evaluation (VPEval) - *[NeurIPS 2023](https://nips.cc/Conferences/2023)*
visual programmingtextimage
https://j-min.io/publication/vl-adapter_cvpr2022/
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks | Jaemin Cho
parameter efficientvladapter
https://j-min.io/publication/paxion_neurips2023/
Paxion: Patching Action Knowledge in Video-Language Foundation Models | Jaemin Cho
Sep 22, 2023 - Analyzing and patching action knowledge in video-language models - *[NeurIPS 2023](https://nips.cc/Conferences/2023)* (Spotlight)
foundation modelspaxionaction
https://j-min.io/publication/envgen_colm2024/
EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents | Jaemin Cho
Aug 6, 2024 - EnvGen is a novel framework that uses LLMs to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak...
generatingadaptingviallms
https://j-min.io/publication/bifrost-1_2025/
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents | Jaemin Cho
Sep 29, 2025 - a unified framework that bridges multimodal LLMs and diffusion models with patch-level CLIP latents
multimodal llmsbifrostmodels