This repository contains a robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization. The project is designed to be modular, configurable, and production-ready.
- Custom Tokenizer Training: Implements BPE (Byte Pair Encoding) tokenizer training with configurable vocabulary size
- LoRA Fine-tuning: Efficient fine-tuning using Low-Rank Adaptation
- QLoRA Support: Quantized LoRA implementation for memory-efficient training (CUDA-only)
- Configurable Pipeline: YAML-based configuration for easy customization
- Robust Logging: Comprehensive logging system for monitoring training progress
- Memory Efficient: Optimized for GPU memory usage with gradient accumulation and mixed precision training
- Production Ready: Includes proper error handling, validation, and directory management
- Python 3.8+
- CUDA-capable GPU (required for QLoRA, recommended for LoRA)
- PyTorch
- Transformers
- Datasets
- PEFT
- Tokenizers
- PyYAML
- bitsandbytes (for QLoRA)
- Clone the repository:
git clone https://github.com/happybear-21/genie.git
cd genie
- Install dependencies:
pip install -r requirements.txt
.
├── config/
│ └── models.yaml # Configuration file
├── finetune.py # Standard LoRA fine-tuning script
├── finetune_qlora.py # QLoRA fine-tuning script
├── tokenizer.py # Custom tokenizer implementation
├── utils.py # Utility functions
└── README.md # This file
The project uses a YAML configuration file (config/models.yaml
) to manage all parameters. Key configuration sections include:
- Data Configuration: Dataset paths and processing parameters
- Model Configuration: Model-specific parameters including LoRA settings
- Training Configuration: Training hyperparameters
- Output Configuration: Logging and model saving paths
The following models are currently implemented and configured in the pipeline:
-
StarCoder2-3B
- Base Model:
bigcode/starcoder2-3b
- Vocabulary Size: 50,000
- Batch Size: 8
- Base Model:
-
CodeLlama-7B
- Base Model:
codellama/CodeLlama-7b-hf
- Vocabulary Size: 32,000
- Batch Size: 8
- Base Model:
-
WizardCoder-34B
- Base Model:
WizardLM/WizardCoder-Python-34B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 4
- Base Model:
-
DeepSeek-Coder-6.7B
- Base Model:
deepseek-ai/deepseek-coder-6.7b-base
- Vocabulary Size: 32,000
- Batch Size: 8
- Base Model:
-
Codestral-22B
- Base Model:
mistralai/Codestral-22B-v0.1
- Vocabulary Size: 32,000
- Batch Size: 4
- Base Model:
-
WizardCoder-15B
- Base Model:
WizardLM/WizardCoder-15B-V1.0
- Vocabulary Size: 32,000
- Batch Size: 6
- Base Model:
-
DeepSeek-Coder-33B
- Base Model:
deepseek-ai/deepseek-coder-33b-base
- Vocabulary Size: 32,000
- Batch Size: 2
- Base Model:
All models are configured with LoRA fine-tuning parameters optimized for their respective sizes, including appropriate rank, alpha, and dropout values.
The repository includes a sample dataset of Perl programming examples that demonstrates the required format. You can use this as a reference for preparing your own dataset.
Sample dataset structure:
dataset/processed/
├── train.jsonl
├── valid.jsonl
└── test.jsonl
Each JSONL file should contain entries in the following format:
{"prompt": "input text", "code": "target text"}
For the Perl programming dataset, the format is:
{
"prompt": "Write a Perl script to read a file and count word frequency",
"code": "#!/usr/bin/perl\nuse strict;\nuse warnings;\n\nmy %word_count;\nopen(my $fh, '<', 'input.txt') or die \"Cannot open file: $!\";\nwhile(my $line = <$fh>) {\n chomp($line);\n my @words = split(/\\s+/, $line);\n foreach my $word (@words) {\n $word_count{$word}++;\n }\n}\nclose($fh);\n\nforeach my $word (sort keys %word_count) {\n print \"$word: $word_count{$word}\\n\";\n}"
}
You can examine the sample dataset to understand:
- Input format: Natural language descriptions of programming tasks
- Output format: Complete, executable Perl code solutions
- Data organization: Training, validation, and test splits
To use your own dataset:
- Follow the same JSONL format
- Split your data into train/valid/test files
- Place them in your configured data directory
- Ensure your input/output pairs are properly formatted
Edit config/models.yaml
to set your desired parameters:
data:
processed_dir: "dataset/processed/"
metadata_dir: "dataset/metadata/"
models:
- name: "model_name"
pretrained_model: "model/checkpoint"
max_length: 512
vocab_size: 32000
lora_rank: 8
lora_alpha: 32
lora_dropout: 0.1
training:
epochs: 3
batch_size: 8
learning_rate: 2e-4
weight_decay: 0.01
warmup_steps: 100
gradient_accumulation_steps: 4
- Train the tokenizer:
python tokenizer.py
- Choose your fine-tuning approach:
For standard LoRA fine-tuning:
python finetune.py
For QLoRA fine-tuning (CUDA-only):
python finetune_qlora.py
Note: QLoRA requires CUDA support. If CUDA is not available, the script will fall back to standard CPU training.
The custom tokenizer (tokenizer.py
) implements:
- BPE tokenization with configurable vocabulary size
- Special tokens handling ([PAD], [BOS], [EOS], [UNK])
- Efficient batch processing
- Vocabulary statistics generation
The fine-tuning scripts implement two approaches:
- Standard LoRA (
finetune.py
):
- LoRA adaptation for efficient fine-tuning
- Mixed precision training (FP16) when available
- Gradient accumulation for handling large batches
- TensorBoard integration for monitoring
- Automatic model checkpointing
- QLoRA (
finetune_qlora.py
):
- Quantized LoRA implementation for extreme memory efficiency
- 4-bit quantization of the base model
- CUDA-only optimization
- Automatic fallback to CPU training if CUDA is unavailable
- Same monitoring and checkpointing features as standard LoRA
The utilities module (utils.py
) provides:
- YAML configuration loading
- Logging setup
- Directory validation and creation
- Dataset statistics computation
Training progress can be monitored through:
- Log files in the specified log directory
- TensorBoard metrics
- Model checkpoints saved at regular intervals
- Always validate your configuration before training
- Monitor GPU memory usage during training
- Keep track of model checkpoints
- Use appropriate batch sizes for your GPU
- Regular evaluation of model performance
- For QLoRA, ensure CUDA is available for optimal performance
This project includes comprehensive documentation to help you get started and contribute:
- README.md - Project overview and setup instructions
- CONTRIBUTING.md - Guidelines for contributing to the project
- LICENSE - Software license terms
Please read the Contributing Guidelines before submitting pull requests.
- Hugging Face Transformers library
- PEFT library for LoRA implementation
- Tokenizers library for efficient tokenization
- bitsandbytes library for quantization support