https://openreview.net/forum?id=Kn4p6ebC-A&referrer=%5Bthe%20profile%20of%20Xiantong%20Zhen%5D(%2Fprofile%3Fid%3D~Xiantong_Zhen1)
The task of multimodal referring expression comprehension (REC), aiming at localizing an image region described by a natural language expression, has recently...
commonsense knowledgereferring expressioncktransformerenhanced