Skip to content

A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10K characters for NLP tasks.

License

Notifications You must be signed in to change notification settings

Amir-Hofo/Persian_BPE_Tokenizer

Repository files navigation

Persian BPE Tokenizer (30K)

A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks.

Usage

Encoding

from tokenizers import Tokenizer
tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json")
encoded_text= tokenizer.encode("این یک متن آزمایشی است.")
print("Tokens:", encoded_text.tokens)
print("IDs:", encoded_text.ids)

Decoding

decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids])
print("Decoded:", decoded_text)

Training Data

This tokenizer was trained on the following datasets:

License

Code and tokenizer: MIT License

Evaluation Metrics

  • UNK Rate: 0.0% (on 100,000 samples)
  • Compression Ratio: 4.56 (on 100,000 samples)

Requirements

  • For using the tokenizer:
    • Python >= 3.9
    • tokenizers
  • For training the tokenizer:
    • pandas
    • datasets
    • requests
    • hazm

About

A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10K characters for NLP tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages