In this repository, I'm conducting a few experiments with some datasets.
The cleaned version of the Common Crawl's web Crawl Corpus (C4) dataset is available at Hugging Face Datasets. I decided to use only the subset focused on content in Portuguese. You can visualize the dataset online in the following link: https://huggingface.co/datasets/allenai/c4/viewer/pt?views%5B%5D=pt_train
This is a synthetic dataset, containing educational samples, created by Hugging Face using Mistral LLM. I decided to use only the subset focused on content from Stanford videos. You can visualize the dataset online in the following link: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia/viewer/stanford.
$ poetry install
$ poetry shell
$ python ./src/c4dataset.py
$ python ./src/cosmospedia.py