LLM Fine-Tuning_ From Pretrained to On-device

1. LLM Fine-Tuning: From Pretrained to On-device Lee Junbum (Beomi) jun@beomi.net 2024-11-27 WaveHill Meetup - LLM Fine-Tuning 1

2. Lee Junbum (Beomi) Korean Open Language Model Researcher KcBERT, KoAlpaca, Llama-Ko AI/ML GDE MLE @ Channel Corporation 2

3. Agenda 1. Introduction to LLMs and Recent Trends 2. Understanding Pre-training and Post-training 3. Supervised Fine-Tuning (SFT) 4. Reinforcement Learning from Human Feedback (RLHF) 5. Parameter-Efficient Fine-Tuning (PEFT) with LoRA 6. Quantization Techniques (QLoRA) 7. Retrieval-Augmented Fine-Tuning (RAFT) 8. Hands-on Tutorials with Google Colab 9. Model Conversion and Inference with Llama.cpp 10. Q&A Session 3

4. 1. Introduction to LLMs and Recent Trends 4

5. What are Large Language Models (LLMs)? Definition: LLMs are deep learning models with billions of parameters trained on vast amounts of text data. Capabilities: Natural language understanding Text generation Translation Summarization 5

6. Recent Advancements in LLMs OpenAI GPT4o, o1: Improved reasoning and understanding Multimodal capabilities Google's Gemini: Combines strengths of 2M context with language understanding Anthropic's Claude: Focuses on safe and responsible AI Open-Source LLMs: Rapid growth in community-driven models e. g ., Meta Llama, Google Gemma, Alibaba Qwen, ... 6

7. 7

8. Why Fine-Tune LLMs? Customization: Tailor models to specific domains or tasks . Performance: Enhance accuracy on specialized datasets. Efficiency: Reduce inference time and computational resources. Control: Implement safety measures and bias mitigation. 8

9. 9

10. 10

11. 11

12. 2. Understanding Pre-training and Post-training 12

13. Pre-training Definition: Training a model on large-scale datasets to learn general language patterns. Characteristics: Unsupervised learning Massive datasets (Common Crawl, Wikipedia) Foundation for downstream tasks 13

14. Post-training Definition: Further training of a pre-trained model to improve performance on specific tasks . Includes: Supervised Fine-Tuning ( SFT ) Reinforcement Learning from Human Feedback ( RLHF ) 14

15. SFT vs RLHF SFT: Uses labeled datasets Directly adjusts model weights based on supervised signals RLHF: Incorporates human preferences Uses reinforcement learning algorithms 15

16. 3. Supervised Fine-Tuning (SFT) 16

17. What is SFT? Process: Training a pre-trained model on task-specific labeled data. Objective: Align the model's outputs with desired responses. 17

18. Steps in SFT 1. Data Collection: Curate a dataset relevant to the target task. 2. Data Preprocessing: Clean and tokenize data. 3. Fine-Tuning: Adjust model weights using supervised learning. 4. Evaluation: Assess performance on validation data. 18

19. SFT Dataset Format Input: Prompt / Context Output: Response Example { } "instruction": "What is the capital of Korea?", "output": "The capital of Korea is Seoul." SFT Datasets Alpaca KoAlpaca / KoAlpaca-RealQA 19

20. Alpaca Dataset 20

21. KoAlpaca Dataset 21

22. KoAlpaca-RealQA Dataset 22

23. 23

24. SFT with Hugging Face Hugging Face: Hugging Face Hub Transformers Library Install: pip install transformers datasets trl 24

25. Example: SFT Solar-Ko with KoAlpaca-RealQA from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments from trl import SFTTrainer from datasets import load_dataset tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') train_dataset = load_dataset('beomi/KoAlpaca-RealQA') training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() 25

26. Line-by-Line Load Model and Tokenizer from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained('beomi/Solar-Ko-Recovery-11B') model = AutoModelForCausalLM.from_pretrained('beomi/Solar-Ko-Recovery-11B') Load Dataset from datasets import load_dataset train_dataset = load_dataset('beomi/KoAlpaca-RealQA') 26

27. Training Arguments from transformers import TrainingArguments training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=4, save_total_limit=1, logging_strategy='steps', ) Initialize SFT Trainer & Train from trl import SFTTrainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset['train'], tokenizer=tokenizer, ) trainer.train() 27

28. Inference Examples ### Instruction: 안녕하세요 ### Response: ! 안녕하세요 어떻게 도와드릴까요 ?</s> ### Instruction: 아래 글을 한국어로 번역해줘 . The KoAlpaca-RealQA dataset is a unique Korean instruction dataset designed to closely reflect real user interactions in the Korean language. ### Response: KoAlpaca-RealQA 데이터셋은 한국어 사용자들의 실제 상호작용을 매우 잘 반영하도록 설계된 독특한 한국어 지시 데이터셋입니다 .</s> 28

29. Benefits of SFT Task Specialization: Model becomes adept at specific tasks. Data Efficiency: Requires less data than training from scratch. Improved Performance: Higher accuracy on target tasks. 29

30. 4. Reinforcement Learning from Human Feedback 30

31. What is RLHF? Definition: An approach that uses human preferences to fine-tune models via reinforcement learning. Goal: Align model outputs with human values and expectations. 31

32. Popular RLHF Methods Proximal Policy Optimization (PPO) Direct Preference Optimization (DPO) +a ORPO Online DPO KTO ... 32

33. Proximal Policy Optimization (PPO) Algorithm: Balances exploration and exploitation by optimizing a surrogate objective function. Use Case: Adjusts the policy network to produce desired outputs. Used for training OpenAI's GPT-4o, etc. 33

34. 34

35. Reward Model 35

36. Reward Model Training Implicit prompt preference dataset Chosen Samples Rejected Samples score_chosen score_rejected 36

37. Reward Trainer w/ TRL from peft import LoraConfig, TaskType from transformers import AutoModelForSequenceClassification, AutoTokenizer from trl import RewardTrainer, RewardConfig model = AutoModelForSequenceClassification.from_pretrained("gpt2") peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1, ) # ... trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset, peft_config=peft_config, ) trainer.train() 37

38. Direct Preference Optimization (DPO) Concept: Simplifies RLHF by directly optimizing preferences without the need for reward models. Advantage: Reduces complexity and training time. 38

39. 39

40. Online DPO with Custom Judge w/ TRL 40

41. Why Online DPO? No need to train reward model. Prompt & LLM is all you need. 41

42. 42

43. Setting Up Online DPO 1. Install TRL pip install trl 2. Define the Custom Judge def custom_judge(response): # Implement custom logic to evaluate response # Return a scalar reward return reward 43

44. 3. Initialize the Model and Optimizer from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct') optimizer = torch.optim.AdamW(model.parameters(), lr=1e-6) 44

45. 4. Make a Judge Function from trl import OpenAIPairwiseJudge judge = OpenAIPairwiseJudge( model_name="gpt-4o", system_prompt="...", ) Default system prompt: https://github. c om/huggingface/trl/blob/b80c1a6/trl/trainer/judges.py#L35-L61 45

46. 46

47. Judge Example 47

48. 5. Training Loop with DPO training_args = OnlineDPOConfig( output_dir="aya-expanse-8b-OnlineDPO", logging_steps=1, # max_steps=100, report_to='tensorboard', bf16=True, per_device_train_batch_size=1, gradient_checkpointing="unsloth", max_new_tokens=2000, ) trainer = OnlineDPOTrainer( model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset, ) trainer.train() 48

49. Benefits of RLHF Alignment: Ensures model outputs align with human values. Safety: Reduces harmful or biased outputs. Quality Improvement: Enhances the usefulness of generated content. 49

50. 5. PEFT: Parameter-Efficient Fine-Tuning 50

51. What is PEFT? Definition: Techniques that fine-tune models with fewer parameters, reducing computational resources. Hugging Face: PEFT Library Supported Methods: LoRA, QLoRA, etc. 51

52. Introduction to LoRA LoRA (Low-Rank Adaptation): Decomposes weight updates into low-rank matrices . Keeps original weights frozen. Efficient and memory-saving. 52

53. Use LoRA via PEFT 1. Install PEFT Library pip install peft 2. Load Pre-trained Model from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 53

54. 3. Configure LoRA config = LoraConfig( r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj', 'v_proj'], lora_dropout=0.1, bias='none', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF 54

55. Advantages of LoRA Efficiency: Reduces GPU memory usage. Speed: Faster training times. Scalability: Easier to fine-tune very large models. 55

56. 6. Quantized LoRA (QLoRA) 56

57. Why Quantize Models? Reduce Memory Footprint: Smaller models consume less memory. Increase Inference Speed: Quantized models often run faster. Deploy on Edge Devices: Makes deployment on resource-constrained devices feasible. 57

58. Introduction to QLoRA QLoRA: Combines quantization with LoRA to enable fine-tuning large models on a single GPU. 58

59. Implementing QLoRA 1. Install Necessary Libraries pip install bitsandbytes 59

60. 2. Load 4-bit Quantized Model You can use 4 / 8-bit Quantized models. Use load_in_4bit=True for 4-bit Quantized models. Use load_in_8bit=True for 8-bit Quantized models. from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'meta-llama/Meta-Llama-3.1-8B-Instruct', load_in_4bit=True, device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct') 60

61. 3. Apply LoRA from peft import get_peft_model, LoraConfig config = LoraConfig( r=8, lora_alpha=32, target_modules='all-linear', # ['q_proj', 'v_proj'], lora_dropout=0.1, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 4. Same training loop as SFT/RLHF 61

62. Benefits of QLoRA Resource Efficiency: Fine-tune 70B+ parameter models on a single GPU. Performance: Minimal loss in model accuracy. Cost-Effective: Reduces the need for expensive hardware. 62

63. 7. Retrieval-Augmented Fine-Tuning ( RAFT ) 63

64. What is RAFT? Definition: A technique that combines retrieval mechanisms with fine-tuning to enhance model performance on specific knowledge domains . 64

65. Why Use RAFT? Domain-Specific Knowledge : Incorporate up-to-date or specialized information. Improved Accuracy: Provides relevant context that the base model may lack. Dynamic Updating: Easily update the retrieval database without retraining the model. 65

66. SFT vs RAG vs RAFT 66

67. How to Implement RAFT? 67

68. RAFT Results 68

69. RAFT Results (Korean Example) DSF = Domain-Specific Fine-Tuning https://huggingface. c o/devlim/Korea-HealthCare-RAFT-float16 69

70. RAFT Dataset Example: Korean Wikipedia QA https://huggingface. c o/datasets/beomi/kowikitext-qa-ref-detail-preview?row=0 70

71. 8-1. SFT/RAFT Hands-on with Google Colab 71

72. What do we need? Raw Dataset Q/A pairs + Context(Negative, Positive samples) LLM API for creating Q/A pairs Retrieval API for creating context (not covered in this talk) SFT Trainer from TRL 72

73. Setting Up the Environment 1. Open Google Colab Colab URL: https://beomi.net/sk-2411/raft 2. Enable GPU Runtime 73

74. SFT with QLoRA Step 1: Install Dependencies !pip install -q -U transformers !pip install -q datasets accelerate bitsandbytes lomo-optim hf_transfer trl !pip install -q flash-attn --no-build-isolation 74

75. Step 2: Load a Quantized Model import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "beomi/Solar-Ko-Recovery-11B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True), # nn.Linear to 4bit, torch_dtype=torch.bfloat16, # other to bfloat16 attn_implementation="flash_attention_2", ) 75

76. Step 3: Apply LoRA from peft import get_peft_model, LoraConfig config = LoraConfig( r=16, lora_alpha=32, target_modules='all-linear', #['q_proj', 'v_proj'], lora_dropout=0.05, bias='none', task_type='CAUSAL_LM', ) model = get_peft_model(model, config) 76

77. Step 4: Prepare Dataset from unsloth.chat_templates import get_chat_template tokenizer = get_chat_template( tokenizer, chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, map_eos_token = True, # Maps <|im_end|> to </s> instead ) def formatting_prompts_func(examples): convos = examples["messages"] texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } pass from datasets import load_dataset dataset = load_dataset("beomi/KoAlpaca-RealQA-oai", split = "train") dataset = dataset.map(formatting_prompts_func, batched = True,) 77

78. 78

79. 79

80. Step 5: Fine-Tune the Model from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. args = TrainingArguments( per_device_train_batch_size = 32, gradient_accumulation_steps = 4, warmup_steps = 10, num_train_epochs = 3, # Set this for 1 full training run. # max_steps = 60, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", report_to = "none", # Use this for WandB etc ), ) 80

81. 8-2. OnlineDPO Hands-on with Google Colab 81

82. OnlineDPO? OnlineDPO: A method for fine-tuning LLMs using DPO(Direct Preference Optimization) with Online AI Feedback ! 82

83. 83

84. 84

85. Setting Up the Environment 1. Open Google Colab Colab URL: https://beomi.net/sk-2411/onlinedpo 2. Enable GPU Runtime 85

86. 9. Model Conversion and Inference with Llama.cpp 86

87. What is Llama.cpp? 87

88. 88

89. Converting LoRA Model to GGUF Format 1. Convert the Model LoRA Converter: https://huggingface. c o/spaces/beomi/gguf-my-lora *Base model should use https://huggingface. c o/spaces/ggml-org/gguf-my-repo 89

90. Inference with Llama.cpp on MacBook 1. Install Llama.cpp https://github. c om/ggerganov/llama. c pp/releases 2. Download the Model https://huggingface. c o/beomi/KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA- ChatML-F16-GGUF/tree/main Download LoRA gguf file: KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 3. Load the Model ./llama-server \ --hf-repo beomi/Solar-Ko-Recovery-11B-Q8_0-GGUF \ --hf-file solar-ko-recovery-11b-q8_0.gguf \ -c 2048 --lora KoAlpaca-RealQA-Solar-Ko-Recovery-11B-LoRA-ChatML-f16.gguf 90

91. 91

92. 3. Generate Text Goto http://localhost:8080 Tip: Use | as a stop token. -- {"stop": ["|"]} 92

93. 93

94. 10. Q&A Session 94

95. Additional Resources Transformers Documentation: huggingface. c o/docs/transformers TRL Library: github. c om/huggingface/trl PEFT Library: github. c om/huggingface/peft LLama.cpp: github. c om/ggerganov/llama. c pp 95

96. Thank You! Contact Information Email: jun@beomi.net GitHub: github. c om/beomi Hugging Face : huggingface. c o/beomi Feedback is Welcome 96

97. Glossary LLM: Large Language Model SFT: Supervised Fine-Tuning RLHF: Reinforcement Learning from Human Feedback PEFT: Parameter-Efficient Fine-Tuning LoRA: Low-Rank Adaptation QLoRA: Quantized LoRA RAFT: Retrieval-Augmented Fine-Tuning RAG: Retrieval-Augmented Generation PPO: Proximal Policy Optimization DPO: Direct Preference Optimization 97

98. </s> 98