Fine-Tuning Gemma 3 1B-IT for Financial Sentiment Analysis: A Step-by-Step Guide

Gemma 3 is Google’s latest family of lightweight, state-of-the-art open AI models. Designed for high performance and resource efficiency, the 1B Instruct (IT) version is optimized for instruction-following tasks, making it a powerful yet accessible tool for developers.

More details are in the official announcement: Gemma 3 Blog Post.

Gemma 3 utilizes a transformer architecture enhanced with techniques like RoPE embeddings and GeGLU activations. Key features include:

128K-token context window: Processes extensive information.
Multilingual support: Covers over 140 languages.
Multimodal capabilities: Supports text, images, and videos.
Edge device optimization: Runs efficiently on consumer hardware.

Resources:

Dataset Selection

Annotated datasets for finance and economic texts are relatively rare, with many being proprietary. To address this challenge, researchers from the Aalto University School of Business introduced the FinancialPhraseBank Dataset in 2014, which contains approximately 5,000 sentences.

This dataset provides human-annotated benchmarks, allowing for consistent evaluation of different modeling techniques. The annotations were performed by 16 individuals with a background in financial markets, who categorized the sentences as having a:

Positive impact on stock prices
Negative impact on stock prices
Neutral impact on stock prices

The impact was assessed from an investor’s perspective.

Setting Up the Environment

First, we need to install the necessary Python libraries.

!pip install -q -U transformers!pip install -q -U accelerate!pip install -q -U datasets!pip install -q -U peft!pip install -q -i https://pypi.org/simple/ bitsandbytes

!pip install -q -U trl

Explanation of Key Libraries:

transformers: Hugging Face library for working with pre-trained models.
accelerate: Simplifies distributed training and mixed-precision.
datasets: For quickly loading and processing datasets.
peft: Parameter-Efficient Fine-Tuning methods like LoRA.
bitsandbytes: Enables quantization for memory efficiency.
trl: Tools for supervised fine-tuning (SFT) and reinforcement learning.

Environment Variables and Warnings

We configure environment variables for GPU usage and suppress tokenizer parallelism warnings.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

We also ignore potential warnings during training that don’t affect the outcome.

import warnings
warnings.filterwarnings("ignore")

Importing Core Libraries

Now, let’s import the main libraries needed for the fine-tuning process.

import numpy as np
import pandas as pd
import os
from tqdm import tqdm

import torch

import torch.nn as nn

import transformers

from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline, logging)

from transformers.models.gemma3 import Gemma3ForCausalLM

from datasets import Dataset

from peft import LoraConfig, PeftConfig, PeftModel
from trl import SFTTrainer, SFTConfig
import bitsandbytes as bnb

from sklearn.metrics import (accuracy_score,

classification_report, confusion_matrix)

from sklearn.model_selection import train_test_split

print(f"transformers=={transformers.__version__}")

Device Selection Function

This helper function identifies the best available computing device (e.g., CUDA GPU, Apple MPS, or CPU — however, at the moment, you can get the best results with CUDA).

def define_device():
"""Determine and return the optimal PyTorch device based on availability."""

print(f"PyTorch version: {torch.__version__}", end=" -- ")

if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():

print("using MPS device on macOS")
return torch.device("mps")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print(f"using {device}")
return device

Loading the Model and Tokenizer

We initialize the Gemma 3 1B-IT model and its tokenizer. We set the computation precision (torch.bfloat16 or torch.float16) based on GPU capability and load the model efficiently onto the selected device.

However, Gemma 3 models exhibit better numerical stability and performance when using bfloat16 (bf16) rather than float16 (fp16) because they were trained using TPUs with native BF16 support, making its weights fundamentally optimized for this precision format. If you are working with an older GPU (which doesn’t support bfloat16) and using float16 leads to unsatisfactory results or a brittle model, consider using float32 instead (and restricting the input context if memory is not enough for your system).

compute_dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16

print(f"Using compute dtype {compute_dtype}")device = define_device()

print(f"Operating on {device}")

GEMMA_PATH = "/kaggle/input/gemma-3/transformers/gemma-3-1b-it/1"

model = Gemma3ForCausalLM.from_pretrained( GEMMA_PATH, torch_dtype=compute_dtype,

attn_implementation="eager",

low_cpu_mem_usage=True, device_map=device )

max_seq_length = 8192

tokenizer = AutoTokenizer.from_pretrained( GEMMA_PATH, max_seq_length=max_seq_length, device_map=device )

EOS_TOKEN = tokenizer.eos_token

Let’s quickly verify the model is indeed on the GPU.

is_on_gpu = all(param.device.type == 'cuda' for param in model.parameters())
print("Model is on GPU:", is_on_gpu)

Data Preparation

Now, we load the FinancialPhraseBank dataset and prepare it for fine-tuning.

Load Data: Read the all-data.csv file.
Stratified Split: Create balanced training and test sets with 300 samples per sentiment class (positive, neutral, negative).
Shuffle: Randomize the order of the training data.
Evaluation Set: Use the remaining data for evaluation, ensuring balance by sampling 50 instances per class (with replacement if needed).
Prompt Formatting: Convert the text data into instruction prompts suitable for Gemma 3 IT. Training prompts include the answer, while test prompts ask for the answer.
Dataset Conversion: Wrap the prepared data in Hugging Face Dataset objects.

filename = "/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv" df = pd.read_csv(filename,

names=["sentiment", "text"],

encoding="utf-8", encoding_errors="replace")X_train, X_test = [], []y_true_list = []

for sentiment in ["positive", "neutral", "negative"]:

train, test = train_test_split(df[df.sentiment == sentiment],

train_size=300,

test_size=300,
random_state=42,

stratify=df[df.sentiment == sentiment]["sentiment"])

X_train.append(train) X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10).reset_index(drop=True)

X_test_full = pd.concat(X_test).reset_index(drop=True)

y_true = X_test_full["sentiment"]

X_test = X_test_full[['text']]

train_indices = set(X_train.index)

test_indices = set(X_test_full.index) selected_indices = train_indices | test_indicesX_eval = df.loc[~df.index.isin(selected_indices)].copy()

X_eval = X_eval.groupby('sentiment', group_keys=False).apply(

lambda x: x.sample(n=50, random_state=10, replace=True)
).reset_index(drop=True)

def generate_train_prompt(data_point):

return

f"""Analyze the sentiment of the news headline enclosed in square brackets.Determine if it is positive, neutral, or negative, and return the corresponding sentiment label: "positive", "neutral", or "negative".

[{data_point["text"]}] = {data_point["sentiment"]}

"""

.strip() + EOS_TOKEN

def generate_test_prompt(data_point):

return

f"""Analyze the sentiment of the news headline enclosed in square brackets.Determine if it is positive, neutral, or negative, and return the corresponding sentiment label: "positive", "neutral", or "negative".

[{data_point["text"]}] = """

.strip()

X_train = pd.DataFrame(X_train.apply(generate_train_prompt, axis=1), columns=["text"])

X_eval = pd.DataFrame(X_eval.apply(generate_train_prompt, axis=1), columns=["text"])

X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)eval_data = Dataset.from_pandas(X_eval)

Evaluation and Prediction Functions

We need functions to evaluate the model’s performance and generate predictions.

Evaluation function

This function calculates accuracy (overall and per class), generates a classification report (precision, recall, F1-score), and displays a confusion matrix.

def evaluate(y_true, y_pred):
"""Evaluate the fine-tuned sentiment model's performance."""

label_mapping = {'positive': 2, 'neutral': 1, 'negative': 0}

y_true_num = np.array([label_mapping.get(label, 1) for label in y_true])

y_pred_num = np.array([label_mapping.get(label, 1) for label in y_pred]) accuracy = accuracy_score(y_true_num, y_pred_num)

print(f'Overall Accuracy: {accuracy:.3f}')

unique_labels = np.unique(y_true_num)

reverse_label_mapping = {v: k for k, v in label_mapping.items()}

for label_num in unique_labels:

label_mask = y_true_num == label_num label_accuracy = accuracy_score(y_true_num[label_mask], y_pred_num[label_mask])

print(f'Accuracy for label {label_num} ({reverse_label_mapping.get(label_num, "unknown")}): {label_accuracy:.3f}')

class_report = classification_report(y_true, y_pred, labels=["negative", "neutral", "positive"], zero_division=0)

print('\nClassification Report:\n', class_report)

conf_matrix = confusion_matrix(y_true_num, y_pred_num, labels=[0, 1, 2])

print('\nConfusion Matrix (Rows: True, Cols: Pred) [Neg, Neu, Pos]:\n', conf_matrix)

Prediction function

This other function takes the test data, model, and tokenizer to generate sentiment predictions for each headline. It parses the predicted label by using the model’s generate method with controlled parameters (max_new_tokens, temperature=0.0 for deterministic output).

def predict(X_test_df, model_to_use, tokenizer_to_use, device_to_use=device, max_new_tokens=5, temperature=0.0):
"""Predict the sentiment of news headlines using the provided model and tokenizer.""" y_pred = []

model_to_use.eval()

for i in tqdm(range(len(X_test_df)), desc="Predicting Sentiments"):

prompt = X_test_df.iloc[i]["text"]

input_ids = tokenizer_to_use(prompt, return_tensors="pt").to(device_to_use)

with torch.no_grad():

outputs = model_to_use.generate(**input_ids, max_new_tokens=max_new_tokens, temperature=temperature, pad_token_id=tokenizer_to_use.eos_token_id )

prompt_end_marker = "]= "

full_decoded_text = tokenizer_to_use.decode(outputs[0], skip_special_tokens=True)

try:

generated_text = full_decoded_text.split(prompt_end_marker)[1].strip().lower()
except IndexError:
generated_text = ""

if "positive" in generated_text:

y_pred.append("positive")
elif "negative" in generated_text:
y_pred.append("negative")
elif "neutral" in generated_text:
y_pred.append("neutral")
else:

y_pred.append("none")

return y_pred

Baseline Evaluation (Before Fine-Tuning)

Let’s see how the base Gemma 3 1B-IT model performs on our task without any fine-tuning. This establishes a baseline.

y_pred_base = predict(X_test, model, tokenizer)

Now, evaluate these baseline predictions.

print("--- Baseline Model Evaluation ---")
evaluate(y_true, y_pred_base)

--- Baseline Model Evaluation ---Overall Accuracy: 0.550Accuracy for label 0 (negative): 0.603Accuracy for label 1 (neutral): 0.143Accuracy for label 2 (positive): 0.903Classification Report: precision recall f1-score support negative 0.44 0.60 0.51 300 neutral 0.53 0.14 0.23 300 positive 0.88 0.90 0.89 300 accuracy 0.55 900 macro avg 0.62 0.55 0.54 900weighted avg 0.62 0.55 0.54 900Confusion Matrix (Rows: True, Cols: Pred) [Neg, Neu, Pos]: [[181 20 99] [ 14 43 243]

[ 11 18 271]]

Baseline Analysis:

Overall Accuracy (55.0%): The base model struggles significantly with this specific task format.
Neutral Sentiment (14.3% Accuracy, 0.14 Recall): The model is particularly poor at identifying neutral headlines, frequently misclassifying them as positive (243 out of 300 times, according to the confusion matrix).
Positive Sentiment (90.3% Accuracy): It performs reasonably well on positive cases.
Negative Sentiment (60.3% Accuracy): It has moderate success with negative cases but misclassifies a notable number as positive (99 instances).
Bias: The model seems heavily biased towards predicting “positive” and struggles to differentiate neutral cases.

This baseline performance highlights the need for fine-tuning to adapt the model to our specific sentiment analysis task and prompt format.

Fine-Tuning Setup (PEFT with LoRA)

We’ll use Parameter-Efficient Fine-Tuning (PEFT), specifically Low-Rank Adaptation (LoRA), via the trl library’s SFTTrainer. LoRA significantly reduces computational costs by freezing most pre-trained model weights and injecting small, trainable “adapter” layers.

LoRA Configuration

Define the parameters for the LoRA adapters.

peft_config = LoraConfig(
lora_alpha=32,
lora_dropout=0.05,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules="all-linear",
)

Training Configuration (SFTConfig)

Define the arguments for the SFTTrainer. This includes hyperparameters like learning rate, batch size, number of epochs, saving strategy, and optimization settings.

training_arguments = SFTConfig(
output_dir="logs",
num_train_epochs=4,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
optim="adamw_torch_fused",
save_steps=112,
logging_steps=25,
learning_rate=2e-4,
weight_decay=0.001,
fp16=True if compute_dtype == torch.float16 else False,
bf16=True if compute_dtype == torch.bfloat16 else False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=False,
lr_scheduler_type="constant",
report_to="tensorboard",
evaluation_strategy="steps",
eval_steps=112,
load_best_model_at_end=True,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},

dataset_text_field="text",

max_seq_length=max_seq_length,

packing=False,

dataset_kwargs={

"add_special_tokens": False,

"append_concat_token": False, }

)

Initialize SFTTrainer

Create the trainer instance, passing the model, datasets, configurations, and tokenizer.

model.config.use_cache = False

model.config.pretraining_tp = 1

trainer = SFTTrainer( model=model, train_dataset=train_data, eval_dataset=eval_data, peft_config=peft_config, tokenizer=tokenizer, args=training_arguments,

)

Note: The notebook output showing dataset processing logs like “Converting train dataset to ChatML”, “Applying chat template”, “Tokenizing”, “Truncating” are standard steps performed by SFTTrainer based on the model type and configuration. These indicate the data is being prepared correctly for the model.)

Executing the Fine-Tuning

Now, we start the training process using the configured trainer.

print("Starting fine-tuning...")train_result = trainer.train()

print("Fine-tuning finished.")

metrics = train_result.metrics

print("Training Metrics:", metrics)

Starting fine-tuning... # Note: Example output from the notebook execution - actual times/loss will vary # [448/448 30:42, Epoch 3/4] # Step Training Loss Validation Loss # 112 1.155600 1.248115 # 224 0.897200 1.208389 # 336 0.677600 1.255559 # 448 0.455800 1.451596Fine-tuning finished.

Training Metrics: {'train_runtime': 1847.1768, 'train_samples_per_second': 1.949, 'train_steps_per_second': 0.243, 'total_flos': 1363805828158464.0, 'train_loss': 0.8318860807589122, 'epoch': 4.0}

The training logs show the progression of training and validation loss over the epochs.

Saving the Fine-Tuned LoRA Adapter

After training, we save the trained LoRA adapter weights (not the entire model) and the tokenizer. This is efficient as the adapter is much smaller than the base model.

lora_directory = "LoRA-Gemma3-1B-Financial-Sentiment"trainer.model.save_pretrained(lora_directory)

print(f"LoRA adapter saved to {lora_directory}")

trainer.tokenizer.save_pretrained(lora_directory)

print(f"Tokenizer saved to {lora_directory}")

Monitoring with TensorBoard (Optional)

You can use TensorBoard to visualize the training metrics logged during the process. If running in an environment like Google Colab or a local Jupyter setup:

%load_ext tensorboard

%tensorboard --logdir logs/runs

This would launch an interactive TensorBoard interface (usually requires separate execution or specific environment setup).

Evaluation After Fine-Tuning

Let’s evaluate the performance of our fine-tuned model on the test set. The trainer object now holds the best model checkpoint loaded during training (load_best_model_at_end=True).

print("Predicting with fine-tuned model...")

y_pred_tuned = predict(X_test, trainer.model, tokenizer)

print("\n--- Fine-Tuned Model Evaluation ---")

evaluate(y_true, y_pred_tuned)

--- Fine-Tuned Model Evaluation ---Overall Accuracy: 0.853Accuracy for label 0 (negative): 0.977Accuracy for label 1 (neutral): 0.753Accuracy for label 2 (positive): 0.830Classification Report: precision recall f1-score support negative 0.91 0.98 0.94 300 neutral 0.82 0.75 0.78 300 positive 0.83 0.83 0.83 300 accuracy 0.85 900 macro avg 0.85 0.85 0.85 900weighted avg 0.85 0.85 0.85 900Confusion Matrix (Rows: True, Cols: Pred) [Neg, Neu, Pos]: [[293 5 2] [ 24 226 50]

[ 5 46 249]]

Post-Fine-Tuning Analysis:

Significant Improvement: Overall accuracy jumped from 55.0% to 85.3%!
Balanced Performance: The model now performs well across all classes, with F1-scores around 0.94 (negative), 0.78 (neutral), and 0.83 (positive). The macro/weighted averages are strong at 0.85.
Negative Class: Excellent performance with 97.7% accuracy and high precision/recall.
Neutral Class: Dramatically improved from the baseline, reaching 75.3% accuracy and a 0.78 F1-score. It still misclassifies some neutral cases (mostly as positive), but far fewer than before.
Positive Class: Solid performance with 83.0% accuracy and balanced precision/recall.

Comparison with Gemma 7B-IT

As noted in the original notebook, comparing this to previous experiments with a quantized Gemma 7B-IT model (see original notebook/article link if available), the larger model might achieve slightly higher overall accuracy and better neutral performance. However, the Gemma 3 1B-IT model, despite being significantly smaller (roughly 7x fewer parameters), delivers remarkably strong results after fine-tuning, which are not far from those obtained by the Gemma 7B-IT model.

Gemma 3 1B-IT offers an excellent balance between performance and resource efficiency, making it a highly viable option for sentiment analysis, especially when computational resources are limited or faster inference is needed.

Saving Results and Merged Model

Finally, let’s save the predictions for analysis and create a final, merged model containing both the base Gemma weights and the fine-tuned LoRA adapter.

Save Predictions

Store the test texts, true labels, and predicted labels in a CSV file for detailed error analysis.

evaluation = pd.DataFrame({'text': X_test["text"],
'y_true':y_true,
'y_pred': y_pred_tuned}, )

output_predictions_file = "test_predictions_gemma3_1b_tuned.csv"

evaluation.to_csv(output_predictions_file, index=False)
print(f"Test predictions saved to {output_predictions_file}")

Merge LoRA Adapter and Save Full Model

Load the base model again, merge the trained LoRA weights into it, and save the resulting standalone fine-tuned model. This merged model can be loaded and used directly without needing the peft library for inference.

print("Reloading base model...")

base_model = AutoModelForCausalLM.from_pretrained( GEMMA_PATH, torch_dtype=compute_dtype,

low_cpu_mem_usage=True,

device_map='auto' )

print("Loading PeftModel and merging...")

peft_model = PeftModel.from_pretrained(base_model, lora_directory)merged_model = peft_model.merge_and_unload()

print("Merging complete.")

merged_model_directory = "merged-Gemma3-1B-Financial-Sentiment"

print(f"Saving merged model to {merged_model_directory}...")merged_model.save_pretrained(merged_model_directory,

safe_serialization=True,

max_shard_size="2GB")
print("Merged model saved.")

print(f"Saving tokenizer to {merged_model_directory}...")

tokenizer_for_merged = AutoTokenizer.from_pretrained(lora_directory)tokenizer_for_merged.save_pretrained(merged_model_directory)

print("Tokenizer saved.")

Conclusion

This tutorial demonstrated the process of fine-tuning the Gemma 3 1B-IT model for financial sentiment analysis using the FinancialPhraseBank dataset and PEFT/LoRA. We observed a dramatic improvement in performance after fine-tuning, transforming the model from a poor baseline performer into a highly effective sentiment classifier for this specific task, achieving over 85% accuracy.

The Gemma 3 1B-IT model proves to be a powerful and efficient option, offering strong performance even compared to larger models. This makes it suitable for applications where computational resources or inference speed are constraints. The resulting merged model is now ready for deployment or further experimentation.

We’ve reached the end of this fine-tuning guide. Hopefully, it provided valuable insights into adapting Gemma 3 1B-IT for specific tasks like sentiment analysis. Thanks for joining, and keep exploring the possibilities of smaller LLMs (SLMs) like Gemma!

Enjoy building with AI!

#Gemma3 #Gemmaverse #AI #LLM #FineTuning #SentimentAnalysis #NLP