#4 Arxiv Weekly Insights

Welcome to the fourth edition of "Arxiv Weekly Insights," where we delve into the latest groundbreaking research and developments from the Arxiv repository.

This newsletter is brought to you by SmartXiv, the AI-powered personalized arXiv digest designed to enhance your research experience. With over 1000 research papers uploaded daily on arXiv, it's easy to miss important updates. Let SmartXiv deliver personalized recommendations so you never miss what truly matters to you.
Get started today and save 30% with your annual subscription.

Computer Vision and Pattern Recognition
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Computation and Language
Training LLMs to Recognize Hedges in Spontaneous Narratives
Amie J. Paige, Adil Soubki, John Murzaku, Owen Rambow, Susan E. Brennan

Hedges allow speakers to mark utterances as provisional, whether to signal non-prototypicality or 'fuzziness', to indicate a lack of commitment to an utterance, to attribute responsibility for a statement to someone else, to invite input from a partner, or to soften critical feedback in the service of face-management needs. Here we focus on hedges in an experimentally parameterized corpus of 63 Roadrunner cartoon narratives spontaneously produced from memory by 21 speakers for co-present addressees, transcribed to text (Galati and Brennan, 2010). We created a gold standard of hedges annotated by human coders (the Roadrunner-Hedge corpus) and compared three LLM-based approaches for hedge detection: fine-tuning BERT, and zero and few-shot prompting with GPT-4o and LLaMA-3. The best-performing approach was a fine-tuned BERT model, followed by few-shot GPT-4o. After an error analysis on the top performing approaches, we used an LLM-in-the-Loop approach to improve the gold standard coding, as well as to highlight cases in which hedges are ambiguous in linguistically interesting ways that will guide future research. This is the first step in our research program to train LLMs to interpret and generate collateral signals appropriately and meaningfully in conversation.

Computer Vision and Pattern Recognition
FMiFood: Multi-modal Contrastive Learning for Food Image Classification
Xinyue Pan, Jiangpeng He, Fengqing Zhu

This paper introduces a novel multi-modal contrastive learning framework called FMiFood, which learns more discriminative features by integrating additional contextual information, such as food category text descriptions, to enhance classification accuracy. The framework proposes a flexible matching technique that improves the similarity matching between text and image embeddings to focus on multiple key information.

Computation and Language
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models
Shachi H Kumar, Saurav Sahay, Sahisnu Mazumder, Eda Okur, Ramesh Manuvinakurike, Nicole Beckage, Hsuan Su, Hung-yi Lee, Lama Nachman

This paper presents methods to automatically create adversarial prompts to elicit biased responses from target LLMs. The paper also analyzes several existing automatic evaluation methods and metrics, and compares them to human evaluation. The results show that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.

Thunderbolt: Causal Concurrent Consensus and Execution
Junchao Chen, Alberto Sonnino, Lefteris Kokoris-Kogias, Mohammad Sadoghi

Thunderbolt: Causal Concurrent Consensus and Execution introduces Thunderbolt, a novel architecture based on DAG-based protocols, that aims to furnish a scalable and concurrent execution for smart contract transactions. Inspired by Hyperledger, Thunderbolt also expands Execute-Order-Validate architecture in which transactions are distributed into distinct replicas, with execution outcomes determined prior to ordering through the DAG-based protocol. Existing protocols adopt serial executions after the ordering to avoid non-determinism. However, Thunderbolt provides parallel pre-execution before the ordering as well as parallel verifications once any source of non-determinism is removed.

Computer Vision and Pattern Recognition
How Well Can Vision Language Models See Image Details?
Chenhui Gou, Abdulwahab Felemban, Faizan Farooq Khan, Deyao Zhu, Jianfei Cai, Hamid Rezatofighi, Mohamed Elhoseiny

This paper explores the ability of Large Language Model-based Vision-Language Models (LLM-based VLMs) to perceive image details beyond the semantic level. The authors introduce a pixel value prediction task (PVP) and find that existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM. However, prediction precision is significantly improved when the vision encoder is also adapted. The research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks requiring detailed image perception.

Thank you for joining us this week. Stay tuned for more insights in our next edition. Until then, happy researching! See you next week!