Dolma
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It was created as a training corpus for OLMo, AI2 language model.
Designed as a foundational training corpus for the AI2 language model, OLMo, Dolma offers an expansive playground for developers, researchers, and innovators.
Usage
This repository contains tools for generating and inspecting Dolma. To get started, install the Dolma Python library from PyPI.
pip install dolma
The dolma CLI can be access using the dolma
command. To see the available commands, use the --help
flag.
dolma --help
At the moment, the CLI supports three commands: tag
, dedupe
, and mix
.
For all commands, configurations can be specified from command line, or by passing a YAML or JSON file using the -c
flag. For example:
dolma -c config.yaml dedupe --dedupe.name "test"
What Sets Dolma Apart?
Versatility: With its extensive and varied content, Dolma provides ample opportunities for experimentation and research across different AI domains.
Ease of Use: The Dolma Python library can be quickly installed from PyPI, allowing you to jump into your projects without delay.
Robust Tools: This repository equips you with tools for generating, tagging, deduplicating, and mixing the dataset, tailoring it to your specific needs.
Getting Started with Dolma
Utilize the dolma CLI to explore the available commands and configurations. From deduplication with dedupe to document mixing with mix, Dolma opens up numerous possibilities. Full usage instructions are available in the repository.
Contribute and Collaborate
Dolma isn't just a dataset; it's a community-driven initiative that welcomes contributions and ideas. Check out the development guide to see how you can get involved.
There are some questions about the fair usage and license of this dataset, but if those concerns are cleared this may be a dataset worth starring or looking into.