Skip to content

A robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization.

License

Notifications You must be signed in to change notification settings

happybear-21/genie

Repository files navigation

Language Model Fine-tuning Pipeline

This repository contains a robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization. The project is designed to be modular, configurable, and production-ready.

Features

  • Custom Tokenizer Training: Implements BPE (Byte Pair Encoding) tokenizer training with configurable vocabulary size
  • LoRA Fine-tuning: Efficient fine-tuning using Low-Rank Adaptation
  • QLoRA Support: Quantized LoRA implementation for memory-efficient training (CUDA-only)
  • Configurable Pipeline: YAML-based configuration for easy customization
  • Robust Logging: Comprehensive logging system for monitoring training progress
  • Memory Efficient: Optimized for GPU memory usage with gradient accumulation and mixed precision training
  • Production Ready: Includes proper error handling, validation, and directory management

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (required for QLoRA, recommended for LoRA)
  • PyTorch
  • Transformers
  • Datasets
  • PEFT
  • Tokenizers
  • PyYAML
  • bitsandbytes (for QLoRA)

Installation

  1. Clone the repository:
git clone https://github.com/happybear-21/genie.git
cd genie
  1. Install dependencies:
pip install -r requirements.txt

Project Structure

.
├── config/
│   └── models.yaml       # Configuration file
├── finetune.py          # Standard LoRA fine-tuning script
├── finetune_qlora.py    # QLoRA fine-tuning script
├── tokenizer.py         # Custom tokenizer implementation
├── utils.py            # Utility functions
└── README.md           # This file

Configuration

The project uses a YAML configuration file (config/models.yaml) to manage all parameters. Key configuration sections include:

  • Data Configuration: Dataset paths and processing parameters
  • Model Configuration: Model-specific parameters including LoRA settings
  • Training Configuration: Training hyperparameters
  • Output Configuration: Logging and model saving paths

Implemented Models

The following models are currently implemented and configured in the pipeline:

  1. StarCoder2-3B

    • Base Model: bigcode/starcoder2-3b
    • Vocabulary Size: 50,000
    • Batch Size: 8
  2. CodeLlama-7B

    • Base Model: codellama/CodeLlama-7b-hf
    • Vocabulary Size: 32,000
    • Batch Size: 8
  3. WizardCoder-34B

    • Base Model: WizardLM/WizardCoder-Python-34B-V1.0
    • Vocabulary Size: 32,000
    • Batch Size: 4
  4. DeepSeek-Coder-6.7B

    • Base Model: deepseek-ai/deepseek-coder-6.7b-base
    • Vocabulary Size: 32,000
    • Batch Size: 8
  5. Codestral-22B

    • Base Model: mistralai/Codestral-22B-v0.1
    • Vocabulary Size: 32,000
    • Batch Size: 4
  6. WizardCoder-15B

    • Base Model: WizardLM/WizardCoder-15B-V1.0
    • Vocabulary Size: 32,000
    • Batch Size: 6
  7. DeepSeek-Coder-33B

    • Base Model: deepseek-ai/deepseek-coder-33b-base
    • Vocabulary Size: 32,000
    • Batch Size: 2

All models are configured with LoRA fine-tuning parameters optimized for their respective sizes, including appropriate rank, alpha, and dropout values.

Usage

1. Prepare Your Dataset

The repository includes a sample dataset of Perl programming examples that demonstrates the required format. You can use this as a reference for preparing your own dataset.

Sample dataset structure:

dataset/processed/
├── train.jsonl
├── valid.jsonl
└── test.jsonl

Each JSONL file should contain entries in the following format:

{"prompt": "input text", "code": "target text"}

For the Perl programming dataset, the format is:

{
    "prompt": "Write a Perl script to read a file and count word frequency",
    "code": "#!/usr/bin/perl\nuse strict;\nuse warnings;\n\nmy %word_count;\nopen(my $fh, '<', 'input.txt') or die \"Cannot open file: $!\";\nwhile(my $line = <$fh>) {\n    chomp($line);\n    my @words = split(/\\s+/, $line);\n    foreach my $word (@words) {\n        $word_count{$word}++;\n    }\n}\nclose($fh);\n\nforeach my $word (sort keys %word_count) {\n    print \"$word: $word_count{$word}\\n\";\n}"
}

You can examine the sample dataset to understand:

  • Input format: Natural language descriptions of programming tasks
  • Output format: Complete, executable Perl code solutions
  • Data organization: Training, validation, and test splits

To use your own dataset:

  1. Follow the same JSONL format
  2. Split your data into train/valid/test files
  3. Place them in your configured data directory
  4. Ensure your input/output pairs are properly formatted

2. Configure Your Training

Edit config/models.yaml to set your desired parameters:

data:
  processed_dir: "dataset/processed/"
  metadata_dir: "dataset/metadata/"

models:
  - name: "model_name"
    pretrained_model: "model/checkpoint"
    max_length: 512
    vocab_size: 32000
    lora_rank: 8
    lora_alpha: 32
    lora_dropout: 0.1

training:
  epochs: 3
  batch_size: 8
  learning_rate: 2e-4
  weight_decay: 0.01
  warmup_steps: 100
  gradient_accumulation_steps: 4

3. Run the Pipeline

  1. Train the tokenizer:
python tokenizer.py
  1. Choose your fine-tuning approach:

For standard LoRA fine-tuning:

python finetune.py

For QLoRA fine-tuning (CUDA-only):

python finetune_qlora.py

Note: QLoRA requires CUDA support. If CUDA is not available, the script will fall back to standard CPU training.

Technical Details

Tokenizer Implementation

The custom tokenizer (tokenizer.py) implements:

  • BPE tokenization with configurable vocabulary size
  • Special tokens handling ([PAD], [BOS], [EOS], [UNK])
  • Efficient batch processing
  • Vocabulary statistics generation

Fine-tuning Process

The fine-tuning scripts implement two approaches:

  1. Standard LoRA (finetune.py):
  • LoRA adaptation for efficient fine-tuning
  • Mixed precision training (FP16) when available
  • Gradient accumulation for handling large batches
  • TensorBoard integration for monitoring
  • Automatic model checkpointing
  1. QLoRA (finetune_qlora.py):
  • Quantized LoRA implementation for extreme memory efficiency
  • 4-bit quantization of the base model
  • CUDA-only optimization
  • Automatic fallback to CPU training if CUDA is unavailable
  • Same monitoring and checkpointing features as standard LoRA

Utility Functions

The utilities module (utils.py) provides:

  • YAML configuration loading
  • Logging setup
  • Directory validation and creation
  • Dataset statistics computation

Monitoring

Training progress can be monitored through:

  • Log files in the specified log directory
  • TensorBoard metrics
  • Model checkpoints saved at regular intervals

Best Practices

  • Always validate your configuration before training
  • Monitor GPU memory usage during training
  • Keep track of model checkpoints
  • Use appropriate batch sizes for your GPU
  • Regular evaluation of model performance
  • For QLoRA, ensure CUDA is available for optimal performance

Documentation

This project includes comprehensive documentation to help you get started and contribute:

Contributing

Please read the Contributing Guidelines before submitting pull requests.

Acknowledgments

  • Hugging Face Transformers library
  • PEFT library for LoRA implementation
  • Tokenizers library for efficient tokenization
  • bitsandbytes library for quantization support

About

A robust pipeline for fine-tuning language models using advanced techniques like LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and custom tokenization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages