Fine-Tuning Gemma 3 1B-IT for Financial Sentiment Analysis: A Step-by-Step Guide

16 min readMar 25, 2025

Image generated by Gemini 2.0 Flash (Image Generation) Experimental

Sentiment analysis in the financial and economic domain is increasingly vital for businesses. It provides crucial insights into market trends, helps manage reputational risks, and informs investment decisions by gauging stakeholder sentiment.

For this reason, I’ve experimented with Kaggle Notebooks in the last year and before, figuring out how different LLMs performed when fine-tuned on a well-known annotated dataset, the FinancialPhraseBank dataset from the Aalto University School of Business. It is a set of approximately 5,000 sentences annotated by 16 human raters, widely used in various studies and research initiatives since its publication in 2014 (Malo, P., Sinha, A., Korhonen, P., Wallenius, J., & Takala, P., 2014, “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the Association for Information Science and Technology, 65[4], 782–796).

The range of experiments, using the same training and test set examples to ensure consistency across tests, involved different LLMs: LLaMA 2, LLaMA 3, Phi-2, Phi-3, Mistral v0.2, and Gemma 7B. Among these, Gemma 7B performed the best, achieving an accuracy of 0.887 on a balanced test set containing positive, neutral, and negative cases. Naturally, using the GPU provided by Kaggle, a P100 (with 16GB of VRAM), I always had to quantize to 4 bits all the models and use a limited input window of max 2048 tokens in order to adapt the fine-tuning process to the available memory.

What if we could replicate the same performances of the best of these LLMs with a model so small you don’t need to quantize it, and you can even double its input window to 8192 (4x more) tokens, fitting everything nicely in a P100 GPU-powered Kaggle Notebook?

Impossible? No, not after the new Gemma 3 models have been launched in Paris a couple of weeks ago.

Gemma 3 being announced at the Gemma Developer Day in Paris

This tutorial walks you through the process of fine-tuning Google’s recently launched Gemma 3 1B-IT model specifically for this task. You can run it on a Google Colab or a Kaggle Notebook. Just follow my explanations and read the inline comments for a detailed illustration of how the code works and what each snippet does.

You can find the complete Kaggle Notebook with all the code and the outputs at https://www.kaggle.com/code/lucamassaron/fine-tune-gemma-3-1b-it-for-sentiment-analysis

Introducing Gemma 3 1B-IT

Gemma 3 is Google’s latest family of lightweight, state-of-the-art open AI models. Designed for high performance and resource efficiency, the 1B Instruct (IT) version is optimized for instruction-following tasks, making it a powerful yet accessible tool for developers.

More details are in the official announcement: Gemma 3 Blog Post.

Gemma 3 utilizes a transformer architecture enhanced with techniques like RoPE embeddings and GeGLU activations. Key features include:

128K-token context window: Processes extensive information.
Multilingual support: Covers over 140 languages.
Multimodal capabilities: Supports text, images, and videos.
Edge device optimization: Runs efficiently on consumer hardware.

Resources:

Dataset Selection

Annotated datasets for finance and economic texts are relatively rare, with many being proprietary. To address this challenge, researchers from the Aalto University School of Business introduced the FinancialPhraseBank Dataset in 2014, which contains approximately 5,000 sentences.

This dataset provides human-annotated benchmarks, allowing for consistent evaluation of different modeling techniques. The annotations were performed by 16 individuals with a background in financial markets, who categorized the sentences as having a:

Positive impact on stock prices
Negative impact on stock prices
Neutral impact on stock prices

The impact was assessed from an investor’s perspective.

Setting Up the Environment

First, we need to install the necessary Python libraries.

!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U peft
!pip install -q -i https://pypi.org/simple/ bitsandbytes
!pip install -q -U trl

Explanation of Key Libraries:

transformers: Hugging Face library for working with pre-trained models.
accelerate: Simplifies distributed training and mixed-precision.
datasets: For quickly loading and processing datasets.
peft: Parameter-Efficient Fine-Tuning methods like LoRA.
bitsandbytes: Enables quantization for memory efficiency.
trl: Tools for supervised fine-tuning (SFT) and reinforcement learning.

Environment Variables and Warnings

We configure environment variables for GPU usage and suppress tokenizer parallelism warnings.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 
os.environ["TOKENIZERS_PARALLELISM"] = "false"

We also ignore potential warnings during training that don’t affect the outcome.

import warnings
warnings.filterwarnings("ignore")

Importing Core Libraries

Now, let’s import the main libraries needed for the fine-tuning process.

import numpy as np
import pandas as pd
import os
from tqdm import tqdm

import torch
import torch.nn as nn

import transformers
from transformers import (AutoModelForCausalLM,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          TrainingArguments, # Note: SFTConfig from TRL is used later
                          pipeline,
                          logging)

# Explicitly import Gemma3ForCausalLM
from transformers.models.gemma3 import Gemma3ForCausalLM

from datasets import Dataset
from peft import LoraConfig, PeftConfig, PeftModel
from trl import SFTTrainer, SFTConfig # Use SFTConfig from TRL
import bitsandbytes as bnb

from sklearn.metrics import (accuracy_score,
                             classification_report,
                             confusion_matrix)

from sklearn.model_selection import train_test_split

# Check transformers version
print(f"transformers=={transformers.__version__}")

Device Selection Function

This helper function identifies the best available computing device (e.g., CUDA GPU, Apple MPS, or CPU — however, at the moment, you can get the best results with CUDA).

def define_device():
    """Determine and return the optimal PyTorch device based on availability."""

    print(f"PyTorch version: {torch.__version__}", end=" -- ")

    # Check if MPS (Metal Performance Shaders) is available for macOS
    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        print("using MPS device on macOS")
        return torch.device("mps")

    # Check for CUDA availability
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"using {device}")
    return device

Loading the Model and Tokenizer

We initialize the Gemma 3 1B-IT model and its tokenizer. We set the computation precision (torch.bfloat16 or torch.float16) based on GPU capability and load the model efficiently onto the selected device.

However, Gemma 3 models exhibit better numerical stability and performance when using bfloat16 (bf16) rather than float16 (fp16) because they were trained using TPUs with native BF16 support, making its weights fundamentally optimized for this precision format. If you are working with an older GPU (which doesn’t support bfloat16) and using float16 leads to unsatisfactory results or a brittle model, consider using float32 instead (and restricting the input context if memory is not enough for your system).

# Determine optimal computation dtype based on GPU capability
# Use bfloat16 if Compute Capability >= 8.0, otherwise float16
compute_dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
print(f"Using compute dtype {compute_dtype}")

# Select the best available device (CPU, CUDA, or MPS)
device = define_device()
print(f"Operating on {device}")

# Path to the pre-trained model (adjust if necessary)
GEMMA_PATH = "/kaggle/input/gemma-3/transformers/gemma-3-1b-it/1"

# Load the model with optimized settings
model = Gemma3ForCausalLM.from_pretrained(
    GEMMA_PATH,
    torch_dtype=compute_dtype,
    attn_implementation="eager", # Specify attention implementation
    low_cpu_mem_usage=True,      # Reduces CPU RAM usage during loading
    device_map=device            # Automatically map model layers to the device
)

# Define maximum sequence length for the tokenizer
max_seq_length = 8192 # Gemma 3 supports long contexts

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    GEMMA_PATH,
    max_seq_length=max_seq_length,
    device_map=device # Map tokenizer operations if relevant (less common)
)

# Store the EOS token for later use in prompts
EOS_TOKEN = tokenizer.eos_token

Let’s quickly verify the model is indeed on the GPU.

# Check if all model parameters are on the CUDA device
is_on_gpu = all(param.device.type == 'cuda' for param in model.parameters())
print("Model is on GPU:", is_on_gpu)

Data Preparation

Now, we load the FinancialPhraseBank dataset and prepare it for fine-tuning.

Load Data: Read the all-data.csv file.
Stratified Split: Create balanced training and test sets with 300 samples per sentiment class (positive, neutral, negative).
Shuffle: Randomize the order of the training data.
Evaluation Set: Use the remaining data for evaluation, ensuring balance by sampling 50 instances per class (with replacement if needed).
Prompt Formatting: Convert the text data into instruction prompts suitable for Gemma 3 IT. Training prompts include the answer, while test prompts ask for the answer.
Dataset Conversion: Wrap the prepared data in Hugging Face Dataset objects.

# Load dataset
filename = "/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv" # Adjust path if needed
df = pd.read_csv(filename,
                 names=["sentiment", "text"],
                 encoding="utf-8", encoding_errors="replace")

# Initialize lists for train and test sets
X_train, X_test = [], []
y_true_list = [] # To store true labels for the test set separately

# Stratified train-test split (300 per sentiment)
for sentiment in ["positive", "neutral", "negative"]:
    # Split data for the current sentiment
    train, test = train_test_split(df[df.sentiment == sentiment],
                                   train_size=300,
                                   test_size=300,
                                   random_state=42,
                                   # Stratify within the sentiment group (though less critical here)
                                   stratify=df[df.sentiment == sentiment]["sentiment"])
    X_train.append(train)
    X_test.append(test)

# Combine splits from all sentiments
X_train = pd.concat(X_train).sample(frac=1, random_state=10).reset_index(drop=True)
X_test_full = pd.concat(X_test).reset_index(drop=True) # Keep full test data temporarily

# Extract true labels before creating prompts
y_true = X_test_full["sentiment"]

# Prepare the test set text data (without labels)
X_test = X_test_full[['text']] # Keep only the text column for test prompt generation

# -- Prepare Evaluation Data --
# Identify indices used in train or test sets
train_indices = set(X_train.index)
test_indices = set(X_test_full.index) # Use index from the full test set before dropping columns
selected_indices = train_indices | test_indices

# Create evaluation set from data not in train or test
X_eval = df.loc[~df.index.isin(selected_indices)].copy()

# Resample evaluation data for balance (50 per class)
# Use 'replace=True' allows sampling with replacement if a class has < 50 samples
X_eval = X_eval.groupby('sentiment', group_keys=False).apply(
    lambda x: x.sample(n=50, random_state=10, replace=True)
).reset_index(drop=True)


# -- Prompt Generation Functions --

# Function to generate training prompts (with label)
def generate_train_prompt(data_point):
    return f"""
Analyze the sentiment of the news headline enclosed in square brackets.
Determine if it is positive, neutral, or negative, and return the corresponding sentiment label: "positive", "neutral", or "negative".

[{data_point["text"]}] = {data_point["sentiment"]}
""".strip() + EOS_TOKEN # Add EOS token

# Function to generate test prompts (without label)
def generate_test_prompt(data_point):
    return f"""
Analyze the sentiment of the news headline enclosed in square brackets.
Determine if it is positive, neutral, or negative, and return the corresponding sentiment label: "positive", "neutral", or "negative".

[{data_point["text"]}] = """.strip() # No label or EOS token needed here for generation


# -- Apply Prompts and Convert to Dataset --

# Apply prompt generation to create the final text column for training and evaluation
X_train = pd.DataFrame(X_train.apply(generate_train_prompt, axis=1), columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_train_prompt, axis=1), columns=["text"])

# Apply prompt generation for the test set
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

# Convert pandas DataFrames to Hugging Face Dataset objects
train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)
# Note: X_test remains a DataFrame for the predict function, y_true holds labels

Evaluation and Prediction Functions

We need functions to evaluate the model’s performance and generate predictions.

Evaluation function

This function calculates accuracy (overall and per class), generates a classification report (precision, recall, F1-score), and displays a confusion matrix.

def evaluate(y_true, y_pred):
    """Evaluate the fine-tuned sentiment model's performance."""

    # Define sentiment label mapping to numeric for scikit-learn metrics
    label_mapping = {'positive': 2, 'neutral': 1, 'negative': 0}
    # Handle 'none' predictions (map to neutral, or handle as error if preferred)
    y_true_num = np.array([label_mapping.get(label, 1) for label in y_true])
    y_pred_num = np.array([label_mapping.get(label, 1) for label in y_pred])

    # Calculate overall accuracy
    accuracy = accuracy_score(y_true_num, y_pred_num)
    print(f'Overall Accuracy: {accuracy:.3f}')

    # Compute accuracy for each sentiment label
    unique_labels = np.unique(y_true_num) # Get unique numeric labels

    # Map numeric back to string for printing
    reverse_label_mapping = {v: k for k, v in label_mapping.items()}

    for label_num in unique_labels:
        label_mask = y_true_num == label_num # Mask for current class
        label_accuracy = accuracy_score(y_true_num[label_mask], y_pred_num[label_mask])
        print(f'Accuracy for label {label_num} ({reverse_label_mapping.get(label_num, "unknown")}): {label_accuracy:.3f}')

    # Generate classification report using string labels for clarity
    class_report = classification_report(y_true, y_pred, labels=["negative", "neutral", "positive"], zero_division=0)
    print('\nClassification Report:\n', class_report)

    # Compute and display confusion matrix (using numeric labels)
    # Ensure labels are ordered correctly: negative(0), neutral(1), positive(2)
    conf_matrix = confusion_matrix(y_true_num, y_pred_num, labels=[0, 1, 2])
    print('\nConfusion Matrix (Rows: True, Cols: Pred) [Neg, Neu, Pos]:\n', conf_matrix)

Prediction function

This other function takes the test data, model, and tokenizer to generate sentiment predictions for each headline. It parses the predicted label by using the model’s generate method with controlled parameters (max_new_tokens, temperature=0.0 for deterministic output).

def predict(X_test_df, model_to_use, tokenizer_to_use, device_to_use=device, max_new_tokens=5, temperature=0.0):
    """Predict the sentiment of news headlines using the provided model and tokenizer."""

    y_pred = [] # List to store predicted sentiment labels
    model_to_use.eval() # Set model to evaluation mode

    # Iterate through each headline in the test DataFrame
    for i in tqdm(range(len(X_test_df)), desc="Predicting Sentiments"):
        prompt = X_test_df.iloc[i]["text"] # Extract the prompt text

        # Tokenize the prompt and move tensors to the correct device
        input_ids = tokenizer_to_use(prompt, return_tensors="pt").to(device_to_use)

        # Generate output from the model
        with torch.no_grad(): # Disable gradient calculations for inference
             outputs = model_to_use.generate(**input_ids,
                                      max_new_tokens=max_new_tokens,
                                      temperature=temperature,
                                      pad_token_id=tokenizer_to_use.eos_token_id # Avoid warning
                                     )

        # Decode the generated tokens (excluding the input prompt)
        # Find the start of the generated part by looking after the prompt structure
        prompt_end_marker = "]= "
        full_decoded_text = tokenizer_to_use.decode(outputs[0], skip_special_tokens=True)

        # Extract only the generated part after the prompt marker
        try:
            generated_text = full_decoded_text.split(prompt_end_marker)[1].strip().lower()
        except IndexError:
            generated_text = "" # Handle cases where the marker isn't found

        # Extract the first predicted sentiment label
        if "positive" in generated_text:
            y_pred.append("positive")
        elif "negative" in generated_text:
            y_pred.append("negative")
        elif "neutral" in generated_text:
            y_pred.append("neutral")
        else:
            # Fallback if no clear label is found in the short generation
            y_pred.append("none")
            # print(f"Warning: Could not parse sentiment from: '{generated_text}' derived from '{full_decoded_text}'")

    return y_pred

Baseline Evaluation (Before Fine-Tuning)

Let’s see how the base Gemma 3 1B-IT model performs on our task without any fine-tuning. This establishes a baseline.

# Generate predictions using the base model
y_pred_base = predict(X_test, model, tokenizer)

Now, evaluate these baseline predictions.

# Evaluate the baseline predictions
print("--- Baseline Model Evaluation ---")
evaluate(y_true, y_pred_base)

--- Baseline Model Evaluation ---
Overall Accuracy: 0.550
Accuracy for label 0 (negative): 0.603
Accuracy for label 1 (neutral): 0.143
Accuracy for label 2 (positive): 0.903

Classification Report:
               precision    recall  f1-score   support

    negative       0.44      0.60      0.51       300
     neutral       0.53      0.14      0.23       300
    positive       0.88      0.90      0.89       300

    accuracy                           0.55       900
   macro avg       0.62      0.55      0.54       900
weighted avg       0.62      0.55      0.54       900


Confusion Matrix (Rows: True, Cols: Pred) [Neg, Neu, Pos]:
 [[181  20  99]
 [ 14  43 243]
 [ 11  18 271]]

Baseline Analysis:

Overall Accuracy (55.0%): The base model struggles significantly with this specific task format.
Neutral Sentiment (14.3% Accuracy, 0.14 Recall): The model is particularly poor at identifying neutral headlines, frequently misclassifying them as positive (243 out of 300 times, according to the confusion matrix).
Positive Sentiment (90.3% Accuracy): It performs reasonably well on positive cases.
Negative Sentiment (60.3% Accuracy): It has moderate success with negative cases but misclassifies a notable number as positive (99 instances).
Bias: The model seems heavily biased towards predicting “positive” and struggles to differentiate neutral cases.

This baseline performance highlights the need for fine-tuning to adapt the model to our specific sentiment analysis task and prompt format.

Fine-Tuning Setup (PEFT with LoRA)

We’ll use Parameter-Efficient Fine-Tuning (PEFT), specifically Low-Rank Adaptation (LoRA), via the trl library’s SFTTrainer. LoRA significantly reduces computational costs by freezing most pre-trained model weights and injecting small, trainable “adapter” layers.

LoRA Configuration

Define the parameters for the LoRA adapters.

peft_config = LoraConfig(
    lora_alpha=32,        # Scaling factor for LoRA weights
    lora_dropout=0.05,    # Dropout probability for LoRA layers
    r=64,                 # Rank of the LoRA decomposition (higher r = more parameters)
    bias="none",          # Whether to train bias parameters ('none', 'all', or 'lora_only')
    task_type="CAUSAL_LM", # Task type is Causal Language Modeling
    target_modules="all-linear", # Apply LoRA to all linear layers
)

Training Configuration (SFTConfig)

Define the arguments for the SFTTrainer. This includes hyperparameters like learning rate, batch size, number of epochs, saving strategy, and optimization settings.

training_arguments = SFTConfig(
    output_dir="logs",                     # Directory to save logs and checkpoints
    num_train_epochs=4,                    # Number of training epochs
    per_device_train_batch_size=1,         # Batch size per GPU (keep small for large models/limited VRAM)
    gradient_accumulation_steps=8,         # Accumulate gradients over 8 steps (effective batch size = 1*8=8)
    optim="adamw_torch_fused",             # Use fused AdamW optimizer (efficient)
    save_steps=112,                        # Save a checkpoint every 112 steps
    logging_steps=25,                      # Log training metrics every 25 steps
    learning_rate=2e-4,                    # Learning rate
    weight_decay=0.001,                    # Weight decay for regularization
    fp16=True if compute_dtype == torch.float16 else False,  # Enable mixed-precision (FP16) if available
    bf16=True if compute_dtype == torch.bfloat16 else False, # Enable mixed-precision (BF16) if available
    max_grad_norm=0.3,                     # Gradient clipping threshold
    max_steps=-1,                          # Max training steps (-1 means use num_train_epochs)
    warmup_ratio=0.03,                     # Proportion of training steps for learning rate warmup
    group_by_length=False,                 # Don't group sequences by length (can sometimes speed up)
    lr_scheduler_type="constant",          # Learning rate scheduler type
    report_to="tensorboard",               # Report metrics to TensorBoard
    evaluation_strategy="steps",           # Evaluate during training at specified step intervals
    eval_steps=112,                        # Evaluate every 112 steps
    load_best_model_at_end=True,           # Load the best model checkpoint at the end of training
    gradient_checkpointing=True,           # Enable gradient checkpointing to save memory
    gradient_checkpointing_kwargs={"use_reentrant": False}, # Recommended setting for new PyTorch versions

    # SFTTrainer specific arguments
    dataset_text_field="text",             # Name of the text field in the dataset
    max_seq_length=max_seq_length,         # Maximum sequence length
    packing=False,                         # Don't pack multiple sequences into one input
    dataset_kwargs={                       # Arguments for dataset processing
        "add_special_tokens": False,       # Don't add special tokens automatically (handled in prompt)
        "append_concat_token": False,      # Don't append concat token (EOS is in our prompt)
    }
)

Initialize SFTTrainer

Create the trainer instance, passing the model, datasets, configurations, and tokenizer.

# Disable caching for training, re-enable for inference later
model.config.use_cache = False

# Set pretraining_tp if relevant for distributed training (usually 1 for single GPU)
model.config.pretraining_tp = 1

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    tokenizer=tokenizer, # Pass tokenizer directly
    args=training_arguments,
    # processing_class=tokenizer -> Not needed if tokenizer passed directly
)

Note: The notebook output showing dataset processing logs like “Converting train dataset to ChatML”, “Applying chat template”, “Tokenizing”, “Truncating” are standard steps performed by SFTTrainer based on the model type and configuration. These indicate the data is being prepared correctly for the model.)

Executing the Fine-Tuning

Now, we start the training process using the configured trainer.

# Train the model
print("Starting fine-tuning...")
train_result = trainer.train()
print("Fine-tuning finished.")

# Optionally, print training metrics
metrics = train_result.metrics
print("Training Metrics:", metrics)

Starting fine-tuning...
 # Note: Example output from the notebook execution - actual times/loss will vary
 # [448/448 30:42, Epoch 3/4]
 # Step  Training Loss   Validation Loss
 # 112   1.155600        1.248115
 # 224   0.897200        1.208389
 # 336   0.677600        1.255559
 # 448   0.455800        1.451596
Fine-tuning finished.
Training Metrics: {'train_runtime': 1847.1768, 'train_samples_per_second': 1.949, 'train_steps_per_second': 0.243, 'total_flos': 1363805828158464.0, 'train_loss': 0.8318860807589122, 'epoch': 4.0}

The training logs show the progression of training and validation loss over the epochs.

Saving the Fine-Tuned LoRA Adapter

After training, we save the trained LoRA adapter weights (not the entire model) and the tokenizer. This is efficient as the adapter is much smaller than the base model.

# Define directory to save LoRA adapter and tokenizer
lora_directory = "LoRA-Gemma3-1B-Financial-Sentiment"

# Save the LoRA adapter weights
trainer.model.save_pretrained(lora_directory)
print(f"LoRA adapter saved to {lora_directory}")

# Save the tokenizer associated with the training
trainer.tokenizer.save_pretrained(lora_directory)
print(f"Tokenizer saved to {lora_directory}")

Monitoring with TensorBoard (Optional)

You can use TensorBoard to visualize the training metrics logged during the process. If running in an environment like Google Colab or a local Jupyter setup:

# Load the TensorBoard notebook extension
%load_ext tensorboard
# Start TensorBoard, pointing to the logs directory
%tensorboard --logdir logs/runs

This would launch an interactive TensorBoard interface (usually requires separate execution or specific environment setup).

Evaluation After Fine-Tuning

Let’s evaluate the performance of our fine-tuned model on the test set. The trainer object now holds the best model checkpoint loaded during training (load_best_model_at_end=True).

# Ensure the model is in evaluation mode
# trainer.model.eval() # Already handled by predict function, but good practice

# Generate predictions using the fine-tuned model from the trainer
print("Predicting with fine-tuned model...")
y_pred_tuned = predict(X_test, trainer.model, tokenizer) # Use trainer.model

# Evaluate the fine-tuned predictions
print("\n--- Fine-Tuned Model Evaluation ---")
evaluate(y_true, y_pred_tuned)

--- Fine-Tuned Model Evaluation ---
Overall Accuracy: 0.853
Accuracy for label 0 (negative): 0.977
Accuracy for label 1 (neutral): 0.753
Accuracy for label 2 (positive): 0.830

Classification Report:
               precision    recall  f1-score   support

    negative       0.91      0.98      0.94       300
     neutral       0.82      0.75      0.78       300
    positive       0.83      0.83      0.83       300

    accuracy                           0.85       900
   macro avg       0.85      0.85      0.85       900
weighted avg       0.85      0.85      0.85       900


Confusion Matrix (Rows: True, Cols: Pred) [Neg, Neu, Pos]:
 [[293   5   2]
 [ 24 226  50]
 [  5  46 249]]

Post-Fine-Tuning Analysis:

Significant Improvement: Overall accuracy jumped from 55.0% to 85.3%!
Balanced Performance: The model now performs well across all classes, with F1-scores around 0.94 (negative), 0.78 (neutral), and 0.83 (positive). The macro/weighted averages are strong at 0.85.
Negative Class: Excellent performance with 97.7% accuracy and high precision/recall.
Neutral Class: Dramatically improved from the baseline, reaching 75.3% accuracy and a 0.78 F1-score. It still misclassifies some neutral cases (mostly as positive), but far fewer than before.
Positive Class: Solid performance with 83.0% accuracy and balanced precision/recall.

Comparison with Gemma 7B-IT

As noted in the original notebook, comparing this to previous experiments with a quantized Gemma 7B-IT model (see original notebook/article link if available), the larger model might achieve slightly higher overall accuracy and better neutral performance. However, the Gemma 3 1B-IT model, despite being significantly smaller (roughly 7x fewer parameters), delivers remarkably strong results after fine-tuning, which are not far from those obtained by the Gemma 7B-IT model.

Gemma 3 1B-IT offers an excellent balance between performance and resource efficiency, making it a highly viable option for sentiment analysis, especially when computational resources are limited or faster inference is needed.

Saving Results and Merged Model

Finally, let’s save the predictions for analysis and create a final, merged model containing both the base Gemma weights and the fine-tuned LoRA adapter.

Save Predictions

Store the test texts, true labels, and predicted labels in a CSV file for detailed error analysis.

# Create DataFrame with test texts, true labels, and predicted labels
evaluation = pd.DataFrame({'text': X_test["text"], # Use the prompts used for prediction
                           'y_true':y_true,
                           'y_pred': y_pred_tuned},
                         )

# Save the evaluation DataFrame to a CSV file
output_predictions_file = "test_predictions_gemma3_1b_tuned.csv"
evaluation.to_csv(output_predictions_file, index=False)
print(f"Test predictions saved to {output_predictions_file}")

Merge LoRA Adapter and Save Full Model

Load the base model again, merge the trained LoRA weights into it, and save the resulting standalone fine-tuned model. This merged model can be loaded and used directly without needing the peft library for inference.

# --- Reload and Merge ---
# Ensure the trainer and original model are not needed anymore to free memory if necessary
# import gc
# del trainer, model
# gc.collect()
# torch.cuda.empty_cache()

print("Reloading base model...")
# Load the base model again (ensure enough RAM/VRAM)
base_model = AutoModelForCausalLM.from_pretrained(
    GEMMA_PATH,
    torch_dtype=compute_dtype, # Use the same dtype as training
    low_cpu_mem_usage=True,
    device_map='auto' # Let transformers handle device mapping
)

print("Loading PeftModel and merging...")
# Load the PeftModel by combining base model and LoRA adapter
peft_model = PeftModel.from_pretrained(base_model, lora_directory)

# Merge the LoRA weights into the base model
merged_model = peft_model.merge_and_unload()
print("Merging complete.")

# --- Save Merged Model and Tokenizer ---
merged_model_directory = "merged-Gemma3-1B-Financial-Sentiment"
print(f"Saving merged model to {merged_model_directory}...")
merged_model.save_pretrained(merged_model_directory,
                             safe_serialization=True, # Recommended format
                             max_shard_size="2GB")    # Shard large models if needed
print("Merged model saved.")

print(f"Saving tokenizer to {merged_model_directory}...")
# Load the tokenizer from the LoRA directory and save it with the merged model
tokenizer_for_merged = AutoTokenizer.from_pretrained(lora_directory)
tokenizer_for_merged.save_pretrained(merged_model_directory)
print("Tokenizer saved.")

Conclusion

This tutorial demonstrated the process of fine-tuning the Gemma 3 1B-IT model for financial sentiment analysis using the FinancialPhraseBank dataset and PEFT/LoRA. We observed a dramatic improvement in performance after fine-tuning, transforming the model from a poor baseline performer into a highly effective sentiment classifier for this specific task, achieving over 85% accuracy.

The Gemma 3 1B-IT model proves to be a powerful and efficient option, offering strong performance even compared to larger models. This makes it suitable for applications where computational resources or inference speed are constraints. The resulting merged model is now ready for deployment or further experimentation.

We’ve reached the end of this fine-tuning guide. Hopefully, it provided valuable insights into adapting Gemma 3 1B-IT for specific tasks like sentiment analysis. Thanks for joining, and keep exploring the possibilities of smaller LLMs (SLMs) like Gemma!

Enjoy building with AI!

#Gemma3 #Gemmaverse #AI #LLM #FineTuning #SentimentAnalysis #NLP