#22 Arxiv Weekly Insights

Can We Generate Visual Programs Without Prompting LLMs?

Welcome to the 22nd edition of "Arxiv Weekly Insights," where we delve into the latest groundbreaking research and developments from the Arxiv repository.

This newsletter is brought to you by SmartXiv, the AI-powered personalized arXiv digest designed to enhance your research experience.

14 days FREE TRIAL.

Computer Vision and Pattern Recognition
SegFace: Face Segmentation of Long-Tail Classes
Kartik Narayan, Vibashan VS, Vishal M. Patel

SegFace proposes a lightweight transformer-based model to improve face segmentation, particularly for long-tail classes such as eyeglasses, hats, and earrings. The model uses learnable class-specific tokens to focus on each class independently, achieving a mean F1 score of 88.96 on the CelebAMask-HQ dataset and 93.03 on the LaPa dataset.

Computer Vision and Pattern Recognition
StreamChat: Chatting with Streaming Video
Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare

StreamChat enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content by updating visual context at each decoding step. This approach ensures up-to-date video content is used, improving performance on established benchmarks and achieving superior capabilities in streaming interaction scenarios.

Computation and Language
Fast Prompt Alignment for Text-to-Image Generation
Khalil Mrini, Hanlin Lu, Linjie Yang, Weilin Huang, Heng Wang

Fast Prompt Alignment (FPA) is a prompt optimization framework that enhances text-to-image alignment efficiency without iterative overhead. It uses large language models for single-iteration prompt paraphrasing and demonstrates competitive performance on benchmarks like COCO Captions and PartiPrompts.

Computer Vision and Pattern Recognition
InvDiff: Invariant Guidance for Bias Mitigation in Diffusion Models
Min Hou, Yueying Wu, Chang Xu, Yu-Hao Huang, Chenxi Bai, Le Wu, Jiang Bian

InvDiff aims to mitigate bias in pre-trained diffusion models without relying on auxiliary bias annotations. It learns invariant semantic information for diffusion guidance, demonstrating significant improvements in bias reduction and image generation quality on public benchmarks.

Computer Vision and Pattern Recognition
Can We Generate Visual Programs Without Prompting LLMs?
Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem

Can We Generate Visual Programs Without Prompting LLMs? explores a prompt-free approach to visual programming using synthetic data augmentation. The method decouples programs into templates and arguments, achieving competitive performance with faster inference.

Computer Vision and Pattern Recognition
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Xi Chen, Zhifei Zhang, He Zhang, Yuqian Zhou, Soo Ye Kim, Qing Liu, Yijun Li, Jianming Zhang, Nanxuan Zhao, Yilin Wang, Hui Ding, Zhe Lin, Hengshuang Zhao

We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.


Thank you for joining us this week. Stay tuned for more insights in our next edition. Until then, happy researching! See you next week!