A Step-by-Step Guide to Creating a Custom Vision-Language Dataset for Fine-Tuning Qwen-2-VL with LLaMA-Factory

[

Ashok Poudel

](https://ashokpoudel.medium.com/?source=post_page---byline--c2c996fb67b8---------------------------------------)

Fine-tuning large language models (LLMs) for specialized tasks often requires a well-curated dataset, especially when working with vision-language models like Qwen-2-VL. Qwen-2-VL is a powerful tool for tasks that involve understanding and interpreting both text and images, making it ideal for scenarios like document analysis, visual question answering (VQA), and more. However, creating a custom dataset tailored to the requirements of such models can be challenging.

In this article, I’ll walk you through the entire process of creating a vision-language dataset for fine-tuning Qwen-2-VL using LLaMA-Factory, an open-source library designed for training and fine-tuning models. We’ll cover everything from preparing the data to uploading it to Hugging Face and finally integrating it into a fine-tuning script.

Prerequisites

Before we dive in, make sure you have:

  • Basic knowledge of Python programming.
  • A OpenAI and Hugging Face account and grab API tokens.
  • Basic understanding of Finetuning, LLaMA-Factory and Qwen-2-VL. If you’re new to these tools, you can check out their respective documentation:
  • Qwen-2-VL Fine-tuning Script
  • LLaMA-Factory

Step 1: Setting Up Your Environment

Make sure to install the required libraries before starting:

pip install openai pillow pandas datasets huggingface_hub

Step 2: Creating and Structuring the Dataset

For this example, we are building a dataset for Document VQA, where each image represents a contract document, and the dataset contains question-answer pairs derived from these images.

Preparing Images

  1. Store all your contract images in a folder named images.
  2. Resize the images to ensure they are manageable for the model:

from PIL import Image
import os

def resize_image(image_path, max_size=1024):

with Image.open(image_path) as img: aspect_ratio = img.width / img.height

new_width = max_size if img.width > img.height else int(max_size * aspect_ratio)

new_height = max_size if img.height > img.width else int(max_size / aspect_ratio) resized_img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)

resized_img.save(image_path)

Generating Question-Answer Pairs

For each image, we use an LLM to generate question-answer pairs:

  • Example question: “What is the effective date of the contract?”
  • Example answer: “January 1, 2023”

Here’s a script to process images and generate QA pairs:

import os
import csv
import base64
from PIL import Image
from openai import OpenAI

OPENAI_API_KEY=""

client = OpenAI(api_key=OPENAI_API_KEY)

def resize_image(image_path, max_size=1024):

""" Resize the image while maintaining aspect ratio. Overwrites the original image with the resized version.

"""

with Image.open(image_path) as img: aspect_ratio = img.width / img.height

if img.width > img.height:

new_width = max_size

new_height = int(max_size / aspect_ratio)

else: new_height = max_size

new_width = int(max_size * aspect_ratio)

resized_img = img.resize((new_width, new_height), Image.Resampling.LANCZOS) resized_img.save(image_path)

print(f"Resized and saved: {image_path} to {new_width}x{new_height}")

def generate_question_answer_pairs(image_base64):

""" Function to call GPT-4 model and generate question-answer pairs based on the given image.

"""

image_data_url = f"data:image/png;base64,{image_base64}"

response = client.chat.completions.create(

model="gpt-4o",

messages=[ {

"role": "system",

"content": [ {

"type": "text",

"text":

"""# ROLE: Vision language model dataset generator#Mission: Analyze the given image to generate a list of question-answer pairs based on the text and information present. The questions should focus on key information present or not present in the document. For each question, if the answer is present in the document, provide the exact answer. If the information is not present or cannot be identified from the document, use the answer \"Not Present.\" Avoid using specific names of individuals when possible. The goal is to create both positive and negative answer sets to train the model to understand and distinguish between available and unavailable information.\# Steps1. Examine the document for key pieces of information.2. Identify the following elements where applicable. For example: - Organization names. - Titles and roles. - Dates (effective date, expiration date, etc.). - Signatures. - Specific contract terms, phrases, or numbers.3. Formulate questions based on these identified elements.4. Determine the answer for each question, whether it is directly available or \"Not Present\" if absent.\# Output Format- CSV format with two columns: \"Question\" and \"Answer\".- Each row should represent one question-answer pair.- Format each entry as follows: \"Question\",\"Answer\"- Ensure the output is structured with each question-answer pair on a new line.- Enclose within ```csv and ``` for post processing.

"""

} ] }, {

"role": "user",

"content": [ {

"type": "image_url",

"image_url": {
"url": image_data_url } } ] } ],

temperature=1,

max_tokens=2048,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
response_format={"type": "text"} )

response_text = response.choices[0].message.content

clean_csv = response_text.replace("```csv", "").replace("```", "").strip()

clean_csv = clean_csv.replace('"""', '"') cleaned_lines = [

line for line in clean_csv.splitlines()

if not line.lower().strip().startswith(("question", "answer")) ]

return "\n".join(cleaned_lines)

def get_processed_images(output_csv_path):

""" Reads the output CSV file and returns a set of image names that have already been processed.

"""

processed_images = set()
if os.path.exists(output_csv_path):
with open(output_csv_path, mode='r') as csvfile: csv_reader = csv.reader(csvfile)

next(csv_reader)

for row in csv_reader:
processed_images.add(row[0])
return processed_images

def process_images_in_folder(folder_path, output_csv_path):

""" Function to scan a folder for images, resize, process each using GPT-4 model, and save results into a CSV file. Skips images that have already been processed.

"""

processed_images = get_processed_images(output_csv_path)

with open(output_csv_path, mode='a', newline='') as csvfile:

csv_writer = csv.writer(csvfile)

if os.stat(output_csv_path).st_size == 0:

csv_writer.writerow(['Image Name', 'Question', 'Answer'])

for image_filename in os.listdir(folder_path):

if image_filename.lower().endswith(('.png', '.jpg', '.jpeg')):

if image_filename in processed_images:

print(f"Skipping already processed image: {image_filename}")
continue image_path = os.path.join(folder_path, image_filename)

resize_image(image_path, max_size=1024)

with open(image_path, "rb") as image_file:

image_base64 = base64.b64encode(image_file.read()).decode('utf-8') qa_pairs = generate_question_answer_pairs(image_base64)

for row in qa_pairs.splitlines():

question, answer = row.split(',', 1) csv_writer.writerow([image_filename, question.strip(), answer.strip()])

input_folder_path = "images"

output_csv_path = "output.csv"

process_images_in_folder(input_folder_path, output_csv_path)

This Python code is designed to process a folder of images, resize them, generate question-answer pairs from the content of each image using GPT-4, and save the results into a CSV file. It follows a systematic approach to ensure that images are not processed multiple times, resizing images to a manageable size before processing them with the GPT-4 model. The code handles image-to-text conversion by encoding the images in base64, sending the encoded image data to GPT-4, and parsing the response to extract question-answer pairs. Here is a detailed breakdown of its functionality:

  • Dependencies: The code imports several libraries including os for file and directory handling, csv for reading and writing CSV files, base64 for encoding images, PIL for image processing, and OpenAI for interacting with the OpenAI API.
  • API Initialization: It initializes the OpenAI client using an API key to enable interaction with the GPT-4 model.
  • **resize_image** Function: This function takes an image path and resizes the image to a maximum dimension (1024 pixels), maintaining its aspect ratio, and overwrites the original image.
  • **generate_question_answer_pairs** Function: This function sends a base64-encoded image to the GPT-4 model. The model is instructed to analyze the image and produce question-answer pairs, focusing on key information present in the image or marking answers as "Not Present" when the information is absent. The output is structured as CSV data.
  • **get_processed_images** Function: Reads the output CSV to gather a set of images that have already been processed, preventing redundant processing.
  • **process_images_in_folder** Function: Iterates through images in a specified folder, resizes and encodes them, generates question-answer pairs using the GPT-4 model, and appends the results to a CSV file. It skips already processed images, ensuring efficiency.
  • File Paths: The folder containing images and the output CSV file path are defined as input_folder_path and output_csv_path.
  • Execution: The code processes all images in the specified folder, generates the question-answer pairs, and saves them to a CSV file.

This code is suitable for creating a dataset of question-answer pairs based on image content, particularly for tasks involving information extraction from visual documents like contracts, IDs, and more.

Step 3: Uploading the Dataset to Hugging Face Hub

Now that we have the dataset in a CSV format, let’s prepare it for uploading to the Hugging Face Hub.

import os
import pandas as pd
from datasets import Dataset, Features, Image, Value
from huggingface_hub import HfApi

Read the CSV and group the data:

csv_file_path = 'output.csv'
images_folder_path = 'images' df = pd.read_csv(csv_file_path)

grouped_data = df.groupby('Image Name')

Prepare Data for Hugging Face Dataset:

data_list = []
for image_name, group in grouped_data: messages = [] image_path = os.path.normpath(os.path.join(images_folder_path, image_name))

for idx, row in group.iterrows():

user_message = row['Question'] + ""

messages.append({"role": "user", "content": user_message})
messages.append({"role": "assistant", "content": row['Answer']}) entry = {

"messages": messages,

"images": [image_path] * len(group) } data_list.append(entry)features = Features({

'messages': [{'role': Value('string'), 'content': Value('string')}],

'images': [Image()] })

dataset = Dataset.from_list(data_list, features=features)

Inserting the <image> tags is an important step. You will encounter “ValueError: The number of images does not match the number of tokens” error without it. The error indicates that there is a mismatch between the number of images provided and the number of <image> tokens in your dataset's messages. This often happens when the structure of the data doesn't align as expected during processing.

Upload to Hugging Face Hub:

dataset_repo_id = "hfusername/DOCVQA-dataset"
hf_token = "" api = HfApi()

api.create_repo(repo_id=dataset_repo_id, token=hf_token, repo_type="dataset", exist_ok=True, private=True)

dataset.push_to_hub(dataset_repo_id, token=hf_token)

print(f"Dataset has been uploaded to {dataset_repo_id} on the Hugging Face Hub.")

print("\nSample entry from the dataset:")

print(dataset[1])

VL Dataset in HF

Step 4: Fine-Tuning Qwen-2-VL with LLaMA-Factory

Now that the dataset is ready and uploaded, let’s configure the fine-tuning process using LLaMA-Factory.

Colab notebook: link

Update **dataset_info.json**: In the LLaMAFactory/data/dataset_info.json, add:

"mycustomdataset": {
"hf_hub_url": "your_username/DOCVQA-dataset",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"images": "images"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
}

Fine-Tuning Script: Update dataset argument in the args in the script to fine-tune Qwen-2-VL:

args = dict(
stage="sft",
do_train=True,
model_name_or_path="Qwen/Qwen2-VL-2B-Instruct",
dataset="mycustomdataset",
template="qwen2_vl",
finetuning_type="lora",
lora_target="all",
output_dir="qwen2vl_lora",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
lr_scheduler_type="cosine",
logging_steps=10,
warmup_ratio=0.1,
save_steps=1000,
learning_rate=5e-5,
num_train_epochs=3.0,
max_samples=500,
max_grad_norm=1.0,
loraplus_lr_ratio=16.0,
fp16=True,
use_liger_kernel=True,
)

Run the Script: Execute the script to start the fine-tuning process. Adjust the hyperparameters as needed for optimal performance.

Conclusion

By following this guide, you now have a custom vision-language dataset and a setup to fine-tune the Qwen-2-VL model using LLaMA-Factory. This process is adaptable to various vision-language tasks beyond document VQA, making it a versatile approach for building specialized models.

Happy fine-tuning, and may your models reach new heights of performance!

#VisionLanguageModels #Qwen2VL #MachineLearning #AITraining #DatasetCreation #FineTuning #LLMs #LLaMAFactory #DOCVQA #AIResearch #DataScience #NLP #ComputerVision #AIModelTraining #CustomDataset

Personal Note

I wrote this article because, when I needed to create a vision-language dataset for fine-tuning Qwen_2_VL, I found a lack of resources on how to build a custom dataset for this specific task. Most guides covered the fine-tuning process but skipped over the critical step of dataset creation. Through trial and error, I learned how to adapt tools like the LLaMA-Factory script and wanted to share this knowledge to help others facing similar challenges.

Don’t forget to follow me on Medium and Twitter (@ashokpoudel) for more updates on similar articles. I write and talk about Web Development | Senior Technical Manager | Generative AI Enthusiast. Don’t hesitate to reach out: https://www.linkedin.com/in/ashokpoudel/

Related Reads

Главная - Вики-сайт
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-06 21:39
浙ICP备14020137号-1 $Гость$