LLM Fine-Tuning_ From Pretrained to On-device
如果无法正常显示,请先停止浏览器的去广告插件。
1. LLM Fine-Tuning: From Pretrained to On-device
Lee Junbum (Beomi)
jun@beomi.net
2024-11-27 WaveHill Meetup - LLM Fine-Tuning
1
2. Lee Junbum (Beomi)
Korean Open Language Model
Researcher
KcBERT, KoAlpaca, Llama-Ko
AI/ML GDE
MLE @ Channel Corporation
2
3. Agenda
1. Introduction to LLMs and Recent Trends
2. Understanding Pre-training and Post-training
3. Supervised Fine-Tuning (SFT)
4. Reinforcement Learning from Human Feedback (RLHF)
5. Parameter-Efficient Fine-Tuning (PEFT) with LoRA
6. Quantization Techniques (QLoRA)
7. Retrieval-Augmented Fine-Tuning (RAFT)
8. Hands-on Tutorials with Google Colab
9. Model Conversion and Inference with Llama.cpp
10. Q&A Session
3
4. 1. Introduction to LLMs and Recent Trends
4
5. What are Large Language Models (LLMs)?
Definition: LLMs are deep learning models with billions of parameters trained on vast
amounts of text data.
Capabilities:
Natural language understanding
Text generation
Translation
Summarization
5
6. Recent Advancements in LLMs
OpenAI GPT4o, o1:
Improved reasoning and understanding
Multimodal capabilities
Google's Gemini:
Combines strengths of 2M context with language understanding
Anthropic's Claude:
Focuses on safe and responsible AI
Open-Source LLMs:
Rapid growth in community-driven models
e. g ., Meta Llama, Google Gemma, Alibaba Qwen, ...
6
7. 7
8. Why Fine-Tune LLMs?
Customization: Tailor models to specific domains or tasks .
Performance: Enhance accuracy on specialized datasets.
Efficiency: Reduce inference time and computational resources.
Control: Implement safety measures and bias mitigation.
8
9. 9
10. 10
11. 11
12. 2. Understanding Pre-training and Post-training
12
13. Pre-training
Definition: Training a model on large-scale datasets to learn general language
patterns.
Characteristics:
Unsupervised learning
Massive datasets (Common Crawl, Wikipedia)
Foundation for downstream tasks
13
14. Post-training
Definition: Further training of a pre-trained model to improve performance on
specific tasks .
Includes:
Supervised Fine-Tuning ( SFT )
Reinforcement Learning from Human Feedback ( RLHF )
14
15. SFT vs RLHF
SFT:
Uses labeled datasets
Directly adjusts model weights based on supervised signals
RLHF:
Incorporates human preferences
Uses reinforcement learning algorithms
15
16. 3. Supervised Fine-Tuning (SFT)
16
17. What is SFT?
Process: Training a pre-trained model on task-specific labeled data.
Objective: Align the model's outputs with desired responses.
17
18. Steps in SFT
1. Data Collection:
Curate a dataset relevant to the target task.
2. Data Preprocessing:
Clean and tokenize data.
3. Fine-Tuning:
Adjust model weights using supervised learning.
4. Evaluation:
Assess performance on validation data.
18
19. SFT Dataset Format
Input:
Prompt / Context
Output:
Response
Example
{
}
"instruction": "What is the capital of Korea?",
"output": "The capital of Korea is Seoul."
SFT Datasets
Alpaca
KoAlpaca / KoAlpaca-RealQA
19
20. Alpaca Dataset
20
21. KoAlpaca Dataset
21
22. KoAlpaca-RealQA Dataset
22
23. 23
24. SFT with Hugging Face
Hugging Face:
Hugging Face Hub
Transformers Library
Install:
pip install transformers datasets trl
24
25. Example: SFT Solar-Ko with KoAlpaca-RealQA
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from trl import SFTTrainer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B')
model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B')
train_dataset = load_dataset('beomi/KoAlpaca-RealQA')
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=4,
save_total_limit=1,
logging_strategy='steps',
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset['train'],
tokenizer=tokenizer,
)
trainer.train()
25
26. Line-by-Line
Load Model and Tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B')
model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B')
Load Dataset
from datasets import load_dataset
train_dataset = load_dataset('beomi/KoAlpaca-RealQA')
26
27. Training Arguments
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=4,
save_total_limit=1,
logging_strategy='steps',
)
Initialize SFT Trainer & Train
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset['train'],
tokenizer=tokenizer,
)
trainer.train()
27
28. Inference Examples
### Instruction:
안녕하세요
### Response:
!
안녕하세요 어떻게 도와드릴까요 ?</s>
### Instruction:
아래 글을 한국어로 번역해줘 .
The KoAlpaca-RealQA dataset is a unique Korean instruction dataset
designed to closely reflect real user interactions in the Korean language.
### Response:
KoAlpaca-RealQA
데이터셋은 한국어 사용자들의 실제 상호작용을 매우 잘 반영하도록 설계된
독특한 한국어 지시 데이터셋입니다 .</s>
28
29. Benefits of SFT
Task Specialization: Model becomes adept at specific tasks.
Data Efficiency: Requires less data than training from scratch.
Improved Performance: Higher accuracy on target tasks.
29
30. 4. Reinforcement Learning from Human Feedback
30
31. What is RLHF?
Definition: An approach that uses human preferences to fine-tune models via
reinforcement learning.
Goal: Align model outputs with human values and expectations.
31
32. Popular RLHF Methods
Proximal Policy Optimization (PPO)
Direct Preference Optimization (DPO)
+a
ORPO
Online DPO
KTO
...
32
33. Proximal Policy Optimization (PPO)
Algorithm: Balances exploration and exploitation by optimizing a surrogate objective
function.
Use Case: Adjusts the policy network to produce desired outputs.
Used for training OpenAI's GPT-4o, etc.
33
34. 34
35. Reward Model
35
36. Reward Model Training
Implicit prompt preference dataset
Chosen Samples
Rejected Samples
score_chosen
score_rejected
36
37. Reward Trainer w/ TRL
from peft import LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer, RewardConfig
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
# ...
trainer = RewardTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
)
trainer.train()
37
38. Direct Preference Optimization (DPO)
Concept: Simplifies RLHF by directly optimizing preferences without the need for
reward models.
Advantage: Reduces complexity and training time.
38
39. 39
40. Online DPO with Custom Judge w/ TRL
40
41. Why Online DPO?
No need to train reward model.
Prompt & LLM is all you need.
41
42. 42
43. Setting Up Online DPO
1. Install TRL
pip install trl
2. Define the Custom Judge
def custom_judge(response):
# Implement custom logic to evaluate response
# Return a scalar reward
return reward
43
44. 3. Initialize the Model and Optimizer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct')
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-6)
44
45. 4. Make a Judge Function
from trl import OpenAIPairwiseJudge
judge = OpenAIPairwiseJudge(
model_name="gpt-4o",
system_prompt="...",
)
Default system prompt:
https://github. c om/huggingface/trl/blob/b80c1a6/trl/trainer/judges.py#L35-L61
45
46. 46
47. Judge Example
47
48. 5. Training Loop with DPO
training_args = OnlineDPOConfig(
output_dir="aya-expanse-8b-OnlineDPO",
logging_steps=1,
# max_steps=100,
report_to='tensorboard',
bf16=True,
per_device_train_batch_size=1,
gradient_checkpointing="unsloth",
max_new_tokens=2000,
)
trainer = OnlineDPOTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset,
)
trainer.train()
48
49. Benefits of RLHF
Alignment: Ensures model outputs align with human values.
Safety: Reduces harmful or biased outputs.
Quality Improvement: Enhances the usefulness of generated content.
49
50. 5. PEFT: Parameter-Efficient Fine-Tuning
50
51. What is PEFT?
Definition: Techniques that fine-tune models with fewer parameters, reducing
computational resources.
Hugging Face: PEFT Library
Supported Methods: LoRA, QLoRA, etc.
51
52. Introduction to LoRA
LoRA (Low-Rank Adaptation):
Decomposes weight updates into low-rank matrices .
Keeps original weights frozen.
Efficient and memory-saving.
52
53. Use LoRA via PEFT
1. Install PEFT Library
pip install peft
2. Load Pre-trained Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig
model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')
53
54. 3. Configure LoRA
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules='all-linear', # ['q_proj', 'v_proj'],
lora_dropout=0.1,
bias='none',
)
model = get_peft_model(model, config)
4. Same training loop as SFT/RLHF
54
55. Advantages of LoRA
Efficiency: Reduces GPU memory usage.
Speed: Faster training times.
Scalability: Easier to fine-tune very large models.
55
56. 6. Quantized LoRA (QLoRA)
56
57. Why Quantize Models?
Reduce Memory Footprint: Smaller models consume less memory.
Increase Inference Speed: Quantized models often run faster.
Deploy on Edge Devices: Makes deployment on resource-constrained devices
feasible.
57
58. Introduction to QLoRA
QLoRA: Combines quantization with LoRA to enable fine-tuning large models on a
single GPU.
58
59. Implementing QLoRA
1. Install Necessary Libraries
pip install bitsandbytes
59
60. 2. Load 4-bit Quantized Model
You can use 4 / 8-bit Quantized models.
Use load_in_4bit=True for 4-bit Quantized models.
Use load_in_8bit=True for 8-bit Quantized models.
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Meta-Llama-3.1-8B-Instruct',
load_in_4bit=True,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')
60
61. 3. Apply LoRA
from peft import get_peft_model, LoraConfig
config = LoraConfig(
r=8,
lora_alpha=32,
target_modules='all-linear', # ['q_proj', 'v_proj'],
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM',
)
model = get_peft_model(model, config)
4. Same training loop as SFT/RLHF
61
62. Benefits of QLoRA
Resource Efficiency: Fine-tune 70B+ parameter models on a single GPU.
Performance: Minimal loss in model accuracy.
Cost-Effective: Reduces the need for expensive hardware.
62
63. 7. Retrieval-Augmented Fine-Tuning ( RAFT )
63
64. What is RAFT?
Definition: A technique that combines retrieval mechanisms with fine-tuning to
enhance model performance on specific knowledge domains .
64
65. Why Use RAFT?
Domain-Specific Knowledge : Incorporate up-to-date or specialized information.
Improved Accuracy: Provides relevant context that the base model may lack.
Dynamic Updating: Easily update the retrieval database without retraining the model.
65
66. SFT vs RAG vs RAFT
66
67. How to Implement RAFT?
67
68. RAFT Results
68
69. RAFT Results (Korean Example)
DSF = Domain-Specific Fine-Tuning
https://huggingface. c o/devlim/Korea-HealthCare-RAFT-float16
69
70. RAFT Dataset Example: Korean Wikipedia QA
https://huggingface. c o/datasets/beomi/kowikitext-qa-ref-detail-preview?row=0
70
71. 8-1. SFT/RAFT Hands-on with Google Colab
71
72. What do we need?
Raw Dataset Q/A pairs + Context(Negative, Positive samples)
LLM API for creating Q/A pairs
Retrieval API for creating context (not covered in this talk)
SFT Trainer from TRL
72
73. Setting Up the Environment
1. Open Google Colab
Colab URL: https://beomi.net/sk-2411/raft
2. Enable GPU Runtime
73
74. SFT with QLoRA
Step 1: Install Dependencies
!pip install -q -U transformers
!pip install -q datasets accelerate bitsandbytes lomo-optim hf_transfer trl
!pip install -q flash-attn --no-build-isolation
74
75. Step 2: Load a Quantized Model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "beomi/Solar-Ko-Recovery-11B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=BitsAndBytesConfig(load_in_4bit=True), # nn.Linear to 4bit,
torch_dtype=torch.bfloat16, # other to bfloat16
attn_implementation="flash_attention_2",
)
75
76. Step 3: Apply LoRA
from peft import get_peft_model, LoraConfig
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules='all-linear', #['q_proj', 'v_proj'],
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM',
)
model = get_peft_model(model, config)
76
77. Step 4: Prepare Dataset
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
map_eos_token = True, # Maps <|im_end|> to </s> instead
)
def formatting_prompts_func(examples):
convos = examples["messages"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
pass
from datasets import load_dataset
dataset = load_dataset("beomi/KoAlpaca-RealQA-oai", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
77
78. 78
79. 79
80. Step 5: Fine-Tune the Model
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 32,
gradient_accumulation_steps = 4,
warmup_steps = 10,
num_train_epochs = 3, # Set this for 1 full training run.
# max_steps = 60,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none", # Use this for WandB etc
),
)
80
81. 8-2. OnlineDPO Hands-on with Google Colab
81
82. OnlineDPO?
OnlineDPO: A method for fine-tuning LLMs
using DPO(Direct Preference Optimization) with Online AI Feedback !
82
83. 83
84. 84
85. Setting Up the Environment
1. Open Google Colab
Colab URL: https://beomi.net/sk-2411/onlinedpo
2. Enable GPU Runtime
85
86. 9. Model Conversion and Inference with Llama.cpp
86
87. What is Llama.cpp?
87
88. 88
89. Converting LoRA Model to GGUF Format
1. Convert the Model
LoRA Converter: https://huggingface. c o/spaces/beomi/gguf-my-lora
*Base model should use https://huggingface. c o/spaces/ggml-org/gguf-my-repo
89
90. Inference with Llama.cpp on MacBook
1. Install Llama.cpp
https://github. c om/ggerganov/llama. c pp/releases
2. Download the Model
https://huggingface. c o/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-
ChatML-F16-GGUF/tree/main
Download LoRA gguf file:
KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf
3. Load the Model
./llama-server \
--hf-repo beomi/Solar-Ko-Recovery-11B-Q8_0-GGUF \
--hf-file solar-ko-recovery-11b-q8_0.gguf \
-c 2048 --lora KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf
90
91. 91
92. 3. Generate Text
Goto http://localhost:8080
Tip: Use | as a stop token. -- {"stop": ["|"]}
92
93. 93
94. 10. Q&A Session
94
95. Additional Resources
Transformers Documentation: huggingface. c o/docs/transformers
TRL Library: github. c om/huggingface/trl
PEFT Library: github. c om/huggingface/peft
LLama.cpp: github. c om/ggerganov/llama. c pp
95
96. Thank You!
Contact Information
Email: jun@beomi.net
GitHub: github. c om/beomi
Hugging Face : huggingface. c o/beomi
Feedback is Welcome
96
97. Glossary
LLM: Large Language Model
SFT: Supervised Fine-Tuning
RLHF: Reinforcement Learning from Human Feedback
PEFT: Parameter-Efficient Fine-Tuning
LoRA: Low-Rank Adaptation
QLoRA: Quantized LoRA
RAFT: Retrieval-Augmented Fine-Tuning
RAG: Retrieval-Augmented Generation
PPO: Proximal Policy Optimization
DPO: Direct Preference Optimization
97
98. </s>
98