Free Open-Source Artificial Intelligence

3012 readers

18 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

More AI Communities

LLM Leaderboards

Developer Resources

GitHub Projects

GitHub Stars

FOSAI Time Capsule

founded 2 years ago

MODERATORS

Blaed

fosai

Dolma: The 3 Trillion Token Dataset (self.fosai)

submitted 1 year ago by Blaed to c/fosai

0 comments fedilink hide all child comments

Dolma

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It was created as a training corpus for OLMo, AI2 language model.

Designed as a foundational training corpus for the AI2 language model, OLMo, Dolma offers an expansive playground for developers, researchers, and innovators.

https://github.com/allenai/dolma/tree/main

Usage

This repository contains tools for generating and inspecting Dolma. To get started, install the Dolma Python library from PyPI.

pip install dolma

The dolma CLI can be access using the dolma command. To see the available commands, use the --help flag.

dolma --help

At the moment, the CLI supports three commands: tag, dedupe, and mix.

For all commands, configurations can be specified from command line, or by passing a YAML or JSON file using the -c flag. For example:

dolma -c config.yaml dedupe --dedupe.name "test"

What Sets Dolma Apart?

Versatility: With its extensive and varied content, Dolma provides ample opportunities for experimentation and research across different AI domains.

Ease of Use: The Dolma Python library can be quickly installed from PyPI, allowing you to jump into your projects without delay.

Robust Tools: This repository equips you with tools for generating, tagging, deduplicating, and mixing the dataset, tailoring it to your specific needs.

Getting Started with Dolma

Utilize the dolma CLI to explore the available commands and configurations. From deduplication with dedupe to document mixing with mix, Dolma opens up numerous possibilities. Full usage instructions are available in the repository.

Contribute and Collaborate

Dolma isn't just a dataset; it's a community-driven initiative that welcomes contributions and ideas. Check out the development guide to see how you can get involved.

https://github.com/allenai/dolma/tree/main

There are some questions about the fair usage and license of this dataset, but if those concerns are cleared this may be a dataset worth starring or looking into.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here