Language Model Fine-tuning Pipeline

This repository contains a robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization. The project is designed to be modular, configurable, and production-ready.

Features

Custom Tokenizer Training: Implements BPE (Byte Pair Encoding) tokenizer training with configurable vocabulary size
LoRA Fine-tuning: Efficient fine-tuning using Low-Rank Adaptation
QLoRA Support: Quantized LoRA implementation for memory-efficient training (CUDA-only)
Configurable Pipeline: YAML-based configuration for easy customization
Robust Logging: Comprehensive logging system for monitoring training progress
Memory Efficient: Optimized for GPU memory usage with gradient accumulation and mixed precision training
Production Ready: Includes proper error handling, validation, and directory management

Prerequisites

Python 3.8+
CUDA-capable GPU (required for QLoRA, recommended for LoRA)
PyTorch
Transformers
Datasets
PEFT
Tokenizers
PyYAML
bitsandbytes (for QLoRA)

Installation

Clone the repository:

git clone https://github.com/happybear-21/genie.git
cd genie

Install dependencies:

pip install -r requirements.txt

Project Structure

.
├── config/
│   └── models.yaml       # Configuration file
├── finetune.py          # Standard LoRA fine-tuning script
├── finetune_qlora.py    # QLoRA fine-tuning script
├── tokenizer.py         # Custom tokenizer implementation
├── utils.py            # Utility functions
└── README.md           # This file

Configuration

The project uses a YAML configuration file (config/models.yaml) to manage all parameters. Key configuration sections include:

Data Configuration: Dataset paths and processing parameters
Model Configuration: Model-specific parameters including LoRA settings
Training Configuration: Training hyperparameters
Output Configuration: Logging and model saving paths

Implemented Models

The following models are currently implemented and configured in the pipeline:

StarCoder2-3B
- Base Model: bigcode/starcoder2-3b
- Vocabulary Size: 50,000
- Batch Size: 8
CodeLlama-7B
- Base Model: codellama/CodeLlama-7b-hf
- Vocabulary Size: 32,000
- Batch Size: 8
WizardCoder-34B
- Base Model: WizardLM/WizardCoder-Python-34B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 4
DeepSeek-Coder-6.7B
- Base Model: deepseek-ai/deepseek-coder-6.7b-base
- Vocabulary Size: 32,000
- Batch Size: 8
Codestral-22B
- Base Model: mistralai/Codestral-22B-v0.1
- Vocabulary Size: 32,000
- Batch Size: 4
WizardCoder-15B
- Base Model: WizardLM/WizardCoder-15B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 6
DeepSeek-Coder-33B
- Base Model: deepseek-ai/deepseek-coder-33b-base
- Vocabulary Size: 32,000
- Batch Size: 2

All models are configured with LoRA fine-tuning parameters optimized for their respective sizes, including appropriate rank, alpha, and dropout values.

Usage

1. Prepare Your Dataset

The repository includes a sample dataset of Perl programming examples that demonstrates the required format. You can use this as a reference for preparing your own dataset.

Sample dataset structure:

dataset/processed/
├── train.jsonl
├── valid.jsonl
└── test.jsonl

Each JSONL file should contain entries in the following format:

{"prompt": "input text", "code": "target text"}

For the Perl programming dataset, the format is:

{
    "prompt": "Write a Perl script to read a file and count word frequency",
    "code": "#!/usr/bin/perl\nuse strict;\nuse warnings;\n\nmy %word_count;\nopen(my $fh, '<', 'input.txt') or die \"Cannot open file: $!\";\nwhile(my $line = <$fh>) {\n    chomp($line);\n    my @words = split(/\\s+/, $line);\n    foreach my $word (@words) {\n        $word_count{$word}++;\n    }\n}\nclose($fh);\n\nforeach my $word (sort keys %word_count) {\n    print \"$word: $word_count{$word}\\n\";\n}"
}

You can examine the sample dataset to understand:

Input format: Natural language descriptions of programming tasks
Output format: Complete, executable Perl code solutions
Data organization: Training, validation, and test splits

To use your own dataset:

Follow the same JSONL format
Split your data into train/valid/test files
Place them in your configured data directory
Ensure your input/output pairs are properly formatted

2. Configure Your Training

Edit config/models.yaml to set your desired parameters:

data:
  processed_dir: "dataset/processed/"
  metadata_dir: "dataset/metadata/"

models:
  - name: "model_name"
    pretrained_model: "model/checkpoint"
    max_length: 512
    vocab_size: 32000
    lora_rank: 8
    lora_alpha: 32
    lora_dropout: 0.1

training:
  epochs: 3
  batch_size: 8
  learning_rate: 2e-4
  weight_decay: 0.01
  warmup_steps: 100
  gradient_accumulation_steps: 4

3. Run the Pipeline

Train the tokenizer:

python tokenizer.py

Choose your fine-tuning approach:

For standard LoRA fine-tuning:

python finetune.py

For QLoRA fine-tuning (CUDA-only):

python finetune_qlora.py

Note: QLoRA requires CUDA support. If CUDA is not available, the script will fall back to standard CPU training.

Technical Details

Tokenizer Implementation

The custom tokenizer (tokenizer.py) implements:

BPE tokenization with configurable vocabulary size
Special tokens handling ([PAD], [BOS], [EOS], [UNK])
Efficient batch processing
Vocabulary statistics generation

Fine-tuning Process

The fine-tuning scripts implement two approaches:

Standard LoRA (finetune.py):

LoRA adaptation for efficient fine-tuning
Mixed precision training (FP16) when available
Gradient accumulation for handling large batches
TensorBoard integration for monitoring
Automatic model checkpointing

QLoRA (finetune_qlora.py):

Quantized LoRA implementation for extreme memory efficiency
4-bit quantization of the base model
CUDA-only optimization
Automatic fallback to CPU training if CUDA is unavailable
Same monitoring and checkpointing features as standard LoRA

Utility Functions

The utilities module (utils.py) provides:

YAML configuration loading
Logging setup
Directory validation and creation
Dataset statistics computation

Monitoring

Training progress can be monitored through:

Log files in the specified log directory
TensorBoard metrics
Model checkpoints saved at regular intervals

Best Practices

Always validate your configuration before training
Monitor GPU memory usage during training
Keep track of model checkpoints
Use appropriate batch sizes for your GPU
Regular evaluation of model performance
For QLoRA, ensure CUDA is available for optimal performance

Documentation

This project includes comprehensive documentation to help you get started and contribute:

README.md - Project overview and setup instructions
CONTRIBUTING.md - Guidelines for contributing to the project
LICENSE - Software license terms

Contributing

Please read the Contributing Guidelines before submitting pull requests.

Acknowledgments

Hugging Face Transformers library
PEFT library for LoRA implementation
Tokenizers library for efficient tokenization
bitsandbytes library for quantization support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Model Fine-tuning Pipeline

Features

Prerequisites

Installation

Project Structure

Configuration

Implemented Models

Usage

1. Prepare Your Dataset

2. Configure Your Training

3. Run the Pipeline

Technical Details

Tokenizer Implementation

Fine-tuning Process

Utility Functions

Monitoring

Best Practices

Documentation

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
dataset		dataset
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
finetune.py		finetune.py
finetune_qlora.py		finetune_qlora.py
requirements.txt		requirements.txt
test.py		test.py
tokenizer.py		tokenizer.py
utils.py		utils.py

License

happybear-21/genie

Folders and files

Latest commit

History

Repository files navigation

Language Model Fine-tuning Pipeline

Features

Prerequisites

Installation

Project Structure

Configuration

Implemented Models

Usage

1. Prepare Your Dataset

2. Configure Your Training

3. Run the Pipeline

Technical Details

Tokenizer Implementation

Fine-tuning Process

Utility Functions

Monitoring

Best Practices

Documentation

Contributing

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages