Close Menu
  • Home
  • Daily
  • AI
  • Crypto
  • Bitcoin
  • Stock Market
  • E-game
  • Casino
    • Online Casino bonuses
  • World
  • Affiliate News
  • English
    • Português
    • English
    • Español

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Uber caps employee AI spending after blowing through budget in four months

June 2, 2026

The Greatest ‘One Last Ride’ Games Of All Time, Ranked

June 2, 2026

Georgia Announces Crackdown on Illegal Bitcoin Mining

June 2, 2026
Facebook X (Twitter) Instagram
MetaDaily – Breaking News in Crypto, Markets & Digital Trends
  • Home
  • Daily
  • AI
  • Crypto
  • Bitcoin
  • Stock Market
  • E-game
  • Casino
    • Online Casino bonuses
  • World
  • Affiliate News
  • English
    • Português
    • English
    • Español
MetaDaily – Breaking News in Crypto, Markets & Digital Trends
Home » EleutherAI releases massive AI training dataset of licensed and open domain text
AI

EleutherAI releases massive AI training dataset of licensed and open domain text

adminBy adminJune 6, 2025No Comments3 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp VKontakte Email
Share
Facebook Twitter LinkedIn Pinterest Email
Up to $1500 Welcome Bonus
+50 Freespins
Always 25% Bonus with every Crypto Deposit!
Join Now


EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.

The dataset, called the Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, the Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.

AI companies, including OpenAI, are embroiled in lawsuits over their AI training practices, which rely on scraping the web — including copyrighted material like books and research journals — to build model training datasets. While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI companies, which the organization says has harmed the broader AI research field by making it more difficult to understand how models work and what their flaws might be.

“[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. “Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas.”

The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are evidence that the Common Pile v0.1 was curated carefully enough to enable developers to build models competitive with proprietary alternatives. According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

Parameters, sometimes referred to as weights, are the internal components of an AI model that guide its behavior and answers.

“In general, we think that the common idea that unlicensed text drives performance is unjustified,” Biderman wrote in her post. “As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

The Common Pile v0.1 appears to be in part an effort to right EleutherAI’s historical wrongs. Years ago, the company released The Pile, an open collection of training text that includes copyrighted material. AI companies have come under fire — and legal pressure — for using The Pile to train models.

EleutherAI is committing to releasing open datasets more frequently going forward in collaboration with its research and infrastructure partners.

Updated 9:48 a.m. Pacific: Biderman clarified in a post on X that EleutherAI contributed to the release of the datasets and models, but that their development involved many partners, including the University of Toronto, which helped lead the research.



Source link

Up to $1500 Welcome Bonus
+50 Freespins
Always 25% Bonus with every Crypto Deposit!
Join Now
EleutherAI
Share. Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Email
Previous ArticleUN demands probe after dozens of bodies found at Libya detention sites
Next Article Building More Scalable GenAI Applications for Startups and Developers
admin
  • Website

Related Posts

Uber caps employee AI spending after blowing through budget in four months

June 2, 2026

ZeroDrift raises $10M to protect AI models from themselves

June 2, 2026

Alphabet plans to raise $80 billion to pay for AI buildout

June 1, 2026

Nvidia chases $200B CPU market with AI agent PCs from Microsoft, Dell, and HP

June 1, 2026

Comments are closed.

Our Picks

Voluptatem aliquam adipisci dolor eaque

April 24, 2025

Funeral of Pope Francis Coincides with King’s Day Celebrations in the Netherlands and Curaçao

April 24, 2025

Curaçao’s Waste-to-Energy Plant Remains Unfeasible Due to High Costs

April 23, 2025

Dutch Ministers: No Immediate Threat from Venezuela to ABC Islands

April 23, 2025
Don't Miss
Affiliate Network News

Awin Wins Big at Global Performance Awards 2025

By adminOctober 22, 20250

Awin and our partners made this year’s Global Performance Marketing Awards one to remember, claiming…

Awin Shortlisted 11 Times at GPMA 2025

September 11, 2025

Awin’s CPI Recovers $100M in Affiliate Revenue

September 11, 2025

Awin and Birl partner to transform resale into a scalable growth engine for brands

August 28, 2025
About Us
About Us

Welcome to MetaDaily.io — Your Daily Pulse on the Digital Frontier.

At MetaDaily.io, we bring you the latest, most relevant, and most exciting news from the world of affiliate networks, cryptocurrency, Bitcoin, egaming, and global markets. Whether you’re an investor, gamer, tech enthusiast, or digital entrepreneur, we provide the insights you need to stay ahead of the curve in this fast-moving digital era.

Our Picks

Thailand Cracks Down on Online Gambling, Targets Youth

June 2, 2026

Evolution Wins Malta Award as New Roulette Game Nears Launch

June 1, 2026

Austria Draft Law Opens Door to Online Casino Competition

May 29, 2026

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • Privacy Policy
  • Terms & Conditions
  • DMCA
© 2026 metadaily. Designed by metadaily.

Type above and press Enter to search. Press Esc to cancel.