Skip to content

Samuellucas97/Datasets4LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datasets4LLMs

In this repository, I'm conducting a few experiments with some datasets.

C4 Dataset

The cleaned version of the Common Crawl's web Crawl Corpus (C4) dataset is available at Hugging Face Datasets. I decided to use only the subset focused on content in Portuguese. You can visualize the dataset online in the following link: https://huggingface.co/datasets/allenai/c4/viewer/pt?views%5B%5D=pt_train

Cosmopedia (Syntetic dataset)

This is a synthetic dataset, containing educational samples, created by Hugging Face using Mistral LLM. I decided to use only the subset focused on content from Stanford videos. You can visualize the dataset online in the following link: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia/viewer/stanford.

How to install dependencies and run

$ poetry install
$ poetry shell
$ python ./src/c4dataset.py
$ python ./src/cosmospedia.py

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Languages