arXiv Digest: Must-Read Papers This Week (#20)

Papers on Artificial Intelligence, Machine Learning, Data Structures and Algorithms, Computer Vision and Pattern Recognition

Welcome to the 20th edition of "Arxiv Weekly Insights," where we delve into the latest groundbreaking research and developments from the Arxiv repository.

This newsletter is brought to you by SmartXiv, the AI-powered personalized arXiv digest designed to enhance your research experience.

Start your free trial now

Artificial Intelligence
Cross-modal Information Flow in Multimodal Large Language Models
Zhi Zhang, Srishti Yadav, Fengze Han, Ekaterina Shutova

The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing.

Computer Vision and Pattern Recognition
Hierarchical Information Flow for Generalized Efficient Image Restoration
Yawei Li, Bin Ren, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Nicu Sebe, Ming-Hsuan Yang, Luca Benini

While vision transformers show promise in numerous image restoration (IR) tasks, the challenge remains in efficiently generalizing and scaling up a model for multiple IR tasks. To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. Each level encapsulates different types of information, with higher levels encompassing broader objects and concepts and lower levels focusing on local details. Moreover, the hierarchical tree architecture removes long-range self-attention, improves the computational efficiency and memory utilization, thus preparing it for effective model scaling. Based on that, we explore model scaling to improve our method's capabilities, which is expected to positively impact IR in large-scale training settings. Extensive experimental results show that Hi-IR achieves state-of-the-art performance in seven common image restoration tasks, affirming its effectiveness and generalizability..

Data Structures and Algorithms
Parallel Token Swapping for Qubit Routing
Ishan Bansal, Oktay Günlük, Richard Shapley

In this paper we study a combinatorial reconfiguration problem that involves finding an optimal sequence of swaps to move an initial configuration of tokens that are placed on the vertices of a graph to a final desired one. This problem arises as a crucial step in reducing the depth of a quantum circuit when compiling a quantum algorithm. We provide the first known constant factor approximation algorithms for the parallel token swapping problem on graph topologies that are commonly found in modern quantum computers, including cycle graphs, subdivided star graphs, and grid graphs. We also study the so-called stretch factor of a natural lower bound to the problem, which has been shown to be useful when designing heuristics for the qubit routing problem. Finally, we study the colored version of this reconfiguration problem where some tokens share the same color and are considered indistinguishable.

Machine Learning
Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective
Zhi Zhang, Jiayi Shen, Congfeng Cao, Gaole Dai, Shiji Zhou, Qizhe Zhang, Shanghang Zhang, Ekaterina Shutova

Advancing towards generalist agents necessitates the concurrent processing of multiple tasks using a unified model, thereby underscoring the growing significance of simultaneous model training on multiple downstream tasks. A common issue in multi-task learning is the occurrence of gradient conflict, which leads to potential competition among different tasks during joint training. This competition often results in improvements in one task at the expense of deterioration in another. Although several optimization methods have been developed to address this issue by manipulating task gradients for better task balancing, they cannot decrease the incidence of gradient conflict. In this paper, we systematically investigate the occurrence of gradient conflict across different methods and propose a strategy to reduce such conflicts through sparse training (ST), wherein only a portion of the model's parameters are updated during training while keeping the rest unchanged. Our extensive experiments demonstrate that ST effectively mitigates conflicting gradients and leads to superior performance. Furthermore, ST can be easily integrated with gradient manipulation techniques, thus enhancing their effectiveness.

Computer Vision and Pattern Recognition
Structured light with a million light planes per second
Dhawal Sirikonda, Praneeth Chakravarthula, Ioannis Gkioulekas, Adithya Pediredla

We introduce a structured light system that captures full-frame depth at rates of a thousand frames per second, four times faster than the previous state of the art. Our key innovation to this end is the design of an acousto-optic light scanning device that can scan light planes at rates up to two million planes per second. We combine this device with an event camera for structured light, using the sparse events triggered on the camera as we sweep a light plane on the scene for depth triangulation. In contrast to prior work, where light scanning is the bottleneck towards faster structured light operation, our light scanning device is three orders of magnitude faster than the event camera's full-frame bandwidth, thus allowing us to take full advantage of the event camera's fast operation. To surpass this bandwidth, we additionally demonstrate adaptive scanning of only regions of interest, at speeds an order of magnitude faster than the theoretical full-frame limit for event cameras.


Thank you for joining us this week. Stay tuned for more insights in our next edition. Until then, happy researching! See you next week!