paint-brush
LLaVA-Phi: Qualitative Results - Take A Look At Its Remarkable Generelization Capabilitiesby@textmodels

LLaVA-Phi: Qualitative Results - Take A Look At Its Remarkable Generelization Capabilities

tldt arrow

Too Long; Didn't Read

We present several examples that demonstrate the remarkable generalization capabilities of LLaVA-Phi, comparing its outputs with those of the LLaVA-1.5-13B models.
featured image - LLaVA-Phi: Qualitative Results - Take A Look At Its Remarkable Generelization Capabilities
Writings, Papers and Blogs on Text Models HackerNoon profile picture
0-item

Authors:

(1) Yichen Zhu, Midea Group;

(2) Minjie Zhu, Midea Group and East China Normal University;

(3) Ning Liu, Midea Group;

(4) Zhicai Ou, Midea Group;

(5) Xiaofeng Mou, Midea Group.

Abstract and 1 Introduction

2. Related Work

3. LLaVA-Phi and 3.1. Training

3.2. Qualitative Results

4. Experiments

5. Conclusion, Limitation, and Future Works and References

3.2. Qualitative Results

We present several examples that demonstrate the remarkable generalization capabilities of LLaVA-Phi, comparing its outputs with those of the LLaVA-1.5-13B models. In Figure 1, a meme is displayed, and we ask the visionlanguage assistant to explain why this meme is considered humorous. While LLaVA-1.5-13B provides a reasonable interpretation based on the image, LLaVA-Phi’s response is more empathetic, highlighting the humor by associating the dog’s ’laid-back demeanor’ with the ’stress or fatigue’ typically associated with a ’new workweek’.


In the second example, we instructed the model to generate Python code for converting an Excel table into a bar chart, as illustrated in Figure 2. LLaVA-1.5-13B generated a simplistic code snippet that only reads the table and prints it, diverging from the instructions to create a plot. In contrast, LLaVA-Phi accurately comprehended the task, providing instructions to read the table, add a title and labels, and correctly plot the bar chart using matplotlib. We believe this enhanced code generation capability stems from Phi-2, which was pre-trained on a large corpus of code snippets and is primarily used for code generation.


The third challenge involves solving a simple math problem, requiring the model to accurately recognize text through OCR and then perform the necessary mathematical computations, as shown in Figure 3. LLaVA-1.5-13B, while providing a step-by-step computation based on the image, incorrectly recognized the numbers and mathematical symbols. In contrast, our proposed LLaVA-Phi, without providing a chain-of-thought reasoning, still produces the correct answer. Our quantitative results on ScienceQA further confirm that LLaVA-Phi excels in these types of questionanswering tasks.


This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Yichen Zhu, Midea Group;

(2) Minjie Zhu, Midea Group and East China Normal University;

(3) Ning Liu, Midea Group;

(4) Zhicai Ou, Midea Group;

(5) Xiaofeng Mou, Midea Group.