Preprint: “Towards Best Practices for Open Datasets for LLM Training”

The preprint linked below was recently shared on arXiv.
Title
Towards Best Practices for Open Datasets for LLM Training
Authors
Stefan Baack, Stella Biderman, Kasia Odrozek, et al.
Source
via arXiv
DOI: 10.48550/arXiv.2501.08365
AbstractFigure 1. Tiers of openness of datasets for LLM training. Source: 10.48550/arXiv.2501.08365
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies b