Datasets4LLMs

In this repository, I'm conducting a few experiments with some datasets.

C4 Dataset

The cleaned version of the Common Crawl's web Crawl Corpus (C4) dataset is available at Hugging Face Datasets. I decided to use only the subset focused on content in Portuguese. You can visualize the dataset online in the following link: https://huggingface.co/datasets/allenai/c4/viewer/pt?views%5B%5D=pt_train

Cosmopedia (Syntetic dataset)

This is a synthetic dataset, containing educational samples, created by Hugging Face using Mistral LLM. I decided to use only the subset focused on content from Stanford videos. You can visualize the dataset online in the following link: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia/viewer/stanford.

How to install dependencies and run

$ poetry install
$ poetry shell
$ python ./src/c4dataset.py
$ python ./src/cosmospedia.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
tests		tests
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Datasets4LLMs

C4 Dataset

Cosmopedia (Syntetic dataset)

How to install dependencies and run

About

Uh oh!

Uh oh!

Languages

Samuellucas97/Datasets4LLMs

Folders and files

Latest commit

History

Repository files navigation

Datasets4LLMs

C4 Dataset

Cosmopedia (Syntetic dataset)

How to install dependencies and run

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages