<style>
.dolma-page * { box-sizing: border-box; }
.dolma-page h1, .dolma-page h2, .dolma-page h3, .dolma-page h4, .dolma-page h5, .dolma-page h6, .dolma-page p, .dolma-page ul, .dolma-page ol, .dolma-page li, .dolma-page pre, .dolma-page blockquote, .dolma-page table, .dolma-page td, .dolma-page th { margin: 0; padding: 0; }
.dolma-page {
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
color: var(--el-text-color-primary);
background: var(--el-bg-color);
line-height: 1.6;
}
.dolma-page a { text-decoration: none; color: inherit; }
.dolma-page a:hover { text-decoration: none; }
.dolma-page ul { list-style: none; }
.markdown-body .dolma-page a { color: inherit !important; text-decoration: none !important; }
.markdown-body .dolma-page a:hover { text-decoration: none !important; }
.markdown-body .dolma-page a.s-btn-primary,
.markdown-body .dolma-page a.btn-cta-light { color: #ffffff !important; }
.markdown-body .dolma-page a.s-btn-secondary { color: var(--el-text-color-primary) !important; }
.markdown-body .dolma-page a.btn-cta-ghost { color: #94a3b8 !important; }
.markdown-body .dolma-page a.btn-cta-ghost:hover { color: #e2e8f0 !important; }
.markdown-body .dolma-page h1, .markdown-body .dolma-page h2 { border-bottom: none !important; padding-bottom: 0 !important; }
.dolma-page .s-container { max-width: 1200px; margin: 0 auto; padding: 0 24px; }
.dolma-page .s-container-narrow { max-width: 800px; margin: 0 auto; padding: 0 24px; }
.dolma-page .s-container-wide { max-width: 1100px; margin: 0 auto; padding: 0 32px; }
.dolma-page .s-section { padding: 80px 0; }
.dolma-page .s-section-lg { padding: 100px 0; }
.dolma-page .s-section-sm { padding: 48px 0; }
.dolma-page .s-bg-white { background: var(--el-bg-color); }
.dolma-page .s-bg-gray { background: var(--el-bg-color-page); }
.dolma-page .s-bg-dark { background: #0f172a; color: #f8fafc; }
.dolma-page .s-header { text-align: center; margin-bottom: 64px; }
.dolma-page .s-header h2 {
font-size: clamp(28px, 4vw, 40px);
font-weight: 700;
color: var(--el-text-color-primary);
letter-spacing: normal;
margin-bottom: 20px;
line-height: 1.15;
}
.dolma-page .s-header p {
font-size: clamp(16px, 2vw, 18px);
color: var(--el-text-color-regular);
max-width: 640px;
margin: 0 auto;
line-height: 1.6;
}
.dolma-page .s-bg-dark .s-header h2 { color: #f8fafc; }
.dolma-page .s-bg-dark .s-header p { color: var(--el-text-color-secondary); }
.dolma-page .s-btn-primary {
display: inline-flex; align-items: center; gap: 6px;
padding: 14px 28px;
background: #0284c7; color: #ffffff !important;
border-radius: 9999px; font-size: 15px; font-weight: 600;
transition: background 0.2s, transform 0.15s;
border: none; cursor: pointer;
text-decoration: none !important;
}
.dolma-page .s-btn-primary:hover { background: #0369a1; transform: translateY(-1px); text-decoration: none !important; }
.dolma-page .s-btn-secondary {
display: inline-flex; align-items: center; gap: 6px;
padding: 14px 28px;
background: var(--el-bg-color); color: var(--el-text-color-primary) !important;
border: 1px solid var(--el-border-color-light);
border-radius: 9999px; font-size: 15px; font-weight: 600;
transition: border-color 0.2s, background 0.2s;
cursor: pointer;
text-decoration: none !important;
}
.dolma-page .s-btn-secondary:hover { background: var(--el-bg-color-page); text-decoration: none !important; }
.dolma-hero {
padding: 100px 0 80px;
text-align: center;
background: var(--el-bg-color);
position: relative;
overflow: hidden;
}
.dolma-hero::before {
content: '';
position: absolute;
top: -200px; left: 50%;
transform: translateX(-50%);
width: 900px; height: 500px;
background: radial-gradient(ellipse, rgba(2, 132, 199, 0.06) 0%, transparent 70%);
pointer-events: none;
}
.dolma-page .hero-badge {
display: inline-flex; align-items: center; gap: 8px;
padding: 6px 16px;
background: var(--el-bg-color-page); border: 1px solid var(--el-border-color-light);
border-radius: 9999px; font-size: 13px; font-weight: 600; color: var(--el-text-color-regular);
margin-bottom: 28px;
}
.dolma-page .hero-badge .badge-dot {
width: 6px; height: 6px; background: #10b981; border-radius: 50%;
display: inline-block;
}
.dolma-hero h1 {
font-size: clamp(36px, 5vw, 60px);
font-weight: 700; line-height: 1.05;
letter-spacing: normal; color: var(--el-text-color-primary);
margin-bottom: 20px;
position: relative;
}
.dolma-hero h1 span { color: #0284c7; }
.dolma-page .hero-subtitle {
font-size: clamp(16px, 2vw, 20px);
color: var(--el-text-color-regular); line-height: 1.6;
max-width: 620px; margin: 0 auto 56px;
position: relative;
}
.dolma-page .hero-actions {
display: flex; gap: 12px; justify-content: center;
flex-wrap: wrap; margin-bottom: 56px; position: relative;
}
.dolma-page .hero-highlights {
display: flex; align-items: center; justify-content: center;
gap: 16px; flex-wrap: wrap; position: relative;
}
.dolma-page .hero-highlights .h-item { font-size: 14px; color: var(--el-text-color-regular); font-weight: 500; }
.dolma-page .hero-highlights .h-div { width: 1px; height: 16px; background: var(--el-border-color-light); }
@media (max-width: 640px)
{ .dolma-page .hero-highlights .h-div { display: none; } .dolma-page .hero-highlights { gap: 8px 16px; } .dolma-page .hero-actions { flex-direction: column; align-items: center; } .dolma-page .hero-actions a { width: 100%; max-width: 280px; justify-content: center; } } .dolma-page .hero-cover { max-width: 720px; margin: 48px auto 0; border-radius: 16px; overflow: hidden; box-shadow: 0 8px 32px rgba(0,0,0,0.10); } .dolma-page .hero-cover img { width: 100%; height: auto; display: block; } .dolma-stats { padding: 48px 0; background: var(--el-bg-color-page); border-top: 1px solid var(--el-border-color-lighter); border-bottom: 1px solid var(--el-border-color-lighter); } .dolma-page .stats-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 32px; text-align: center; } .dolma-page .stat-icon { font-size: 28px; margin-bottom: 12px; } .dolma-page .stat-val { font-size: clamp(28px, 4vw, 40px); font-weight: 700; color: var(--el-text-color-primary); letter-spacing: normal; margin-bottom: 4px; } .dolma-page .stat-lbl { font-size: 14px; color: var(--el-text-color-secondary); font-weight: 500; } @media (max-width: 768px) { .dolma-page .stats-grid { grid-template-columns: repeat(2, 1fr); gap: 24px; } } @media (max-width: 480px) { .dolma-page .stats-grid { grid-template-columns: 1fr; gap: 20px; } } .dolma-page .features-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 24px; } .dolma-page .feat-card { padding: 32px 28px; border: none; border-radius: 20px; box-shadow: 0 2px 12px 0 rgba(0,0,0,0.08); background: var(--el-bg-color); transition: border-color 0.2s, box-shadow 0.2s, transform 0.15s; } .dolma-page .feat-card:hover { box-shadow: 0 8px 24px 0 rgba(0,0,0,0.12); transform: translateY(-2px); } .dolma-page .feat-icon { font-size: 32px; margin-bottom: 16px; } .dolma-page .feat-card h3 { font-size: 18px; font-weight: 700; color: var(--el-text-color-primary); margin-bottom: 8px; } .dolma-page .feat-card p { font-size: 15px; color: var(--el-text-color-regular); line-height: 1.6; } @media (max-width: 1024px) { .dolma-page .features-grid { grid-template-columns: repeat(2, 1fr); } } @media (max-width: 640px) { .dolma-page .features-grid { grid-template-columns: 1fr; } } .dolma-page .usecases-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 20px; } .dolma-page .uc-card { padding: 28px 24px; background: var(--el-bg-color); border: none; border-radius: 20px; box-shadow: 0 2px 12px 0 rgba(0,0,0,0.08); text-align: center; transition: border-color 0.2s, box-shadow 0.2s, transform 0.15s; } .dolma-page .uc-card:hover { box-shadow: 0 8px 24px 0 rgba(0,0,0,0.12); transform: translateY(-2px); } .dolma-page .uc-icon { font-size: 36px; margin-bottom: 16px; } .dolma-page .uc-card h3 { font-size: 17px; font-weight: 700; color: var(--el-text-color-primary); margin-bottom: 8px; } .dolma-page .uc-card p { font-size: 14px; color: var(--el-text-color-regular); line-height: 1.6; } @media (max-width: 1024px) { .dolma-page .usecases-grid { grid-template-columns: repeat(2, 1fr); } } @media (max-width: 480px) { .dolma-page .usecases-grid { grid-template-columns: 1fr; } } .dolma-page .code-wrap { border-radius: 16px !important; overflow: hidden !important; border: 1px solid #334155 !important; background: #0f172a !important; max-width: 860px; margin: 0 auto; } .markdown-body .dolma-page .code-wrap { border-radius: 16px !important; overflow: hidden !important; border: 1px solid #334155 !important; background: #0f172a !important; } .dolma-page .code-bar { display: flex !important; align-items: center !important; justify-content: space-between !important; padding: 12px 20px !important; background: #1e293b !important; border-bottom: 1px solid #334155 !important; } .dolma-page .code-dots { display: flex; gap: 6px; } .dolma-page .code-dots i { width: 10px; height: 10px; border-radius: 50%; display: inline-block; } .dolma-page .code-dots .r { background: #ef4444; } .dolma-page .code-dots .y { background: #f59e0b; } .dolma-page .code-dots .g { background: #10b981; } .dolma-page .code-lang { font-size: 12px; color: var(--el-text-color-secondary); font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; } .dolma-page .code-block { padding: 24px !important; margin: 0 !important; overflow-x: auto !important; font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', monospace !important; font-size: 13.5px !important; line-height: 1.7 !important; color: #e2e8f0 !important; white-space: pre !important; background: transparent !important; border: none !important; border-radius: 0 !important; } .markdown-body .dolma-page .code-block { padding: 24px !important; margin: 0 !important; overflow-x: auto !important; font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', monospace !important; font-size: 13.5px !important; line-height: 1.7 !important; color: #e2e8f0 !important; white-space: pre !important; background: transparent !important; border: none !important; border-radius: 0 !important; } .dolma-page .steps-row { display: flex; align-items: flex-start; justify-content: center; margin-bottom: 48px; } .dolma-page .stp-card { flex: 1; max-width: 320px; text-align: center; padding: 0 24px; } .dolma-page .stp-num { font-size: clamp(48px, 6vw, 72px); font-weight: 700; color: #e2e8f0; letter-spacing: -0.04em; line-height: 1; margin-bottom: 20px; } .dolma-page .stp-card h3 { font-size: 18px; font-weight: 700; color: var(--el-text-color-primary); margin-bottom: 10px; } .dolma-page .stp-card p { font-size: 15px; color: var(--el-text-color-regular); line-height: 1.6; } .dolma-page .stp-conn { width: 60px; height: 2px; background: var(--el-border-color-light); margin-top: 36px; flex-shrink: 0; } .dolma-page .steps-cta { text-align: center; } @media (max-width: 768px) { .dolma-page .steps-row { flex-direction: column; align-items: center; gap: 32px; } .dolma-page .stp-conn { width: 2px; height: 32px; margin: 0; } .dolma-page .stp-card { max-width: 100%; } } .dolma-cta { padding: 100px 0; background: #082f49; text-align: center; position: relative; overflow: hidden; } .dolma-cta::before { content: ''; position: absolute; top: -100px; left: 50%; transform: translateX(-50%); width: 700px; height: 400px; background: radial-gradient(ellipse, rgba(56, 189, 248, 0.12) 0%, transparent 70%); pointer-events: none; } .dolma-cta h2 { font-size: clamp(28px, 4vw, 44px); font-weight: 700; color: #bae6fd; letter-spacing: normal; margin-bottom: 28px; position: relative; } .dolma-cta > div > p { font-size: clamp(16px, 2vw, 18px); color: var(--el-text-color-secondary); max-width: 520px; margin: 0 auto 56px; line-height: 1.6; position: relative; } .dolma-page .cta-actions { display: flex; gap: 12px; justify-content: center; flex-wrap: wrap; position: relative; } .dolma-page .btn-cta-light { display: inline-flex; align-items: center; gap: 6px; padding: 14px 32px; background: #0284c7; color: #ffffff !important; border-radius: 9999px; font-size: 15px; font-weight: 700; transition: background 0.2s, transform 0.15s; text-decoration: none !important; } .dolma-page .btn-cta-light:hover { background: #0369a1; transform: translateY(-1px); text-decoration: none !important; } .dolma-page .btn-cta-ghost { display: inline-flex; align-items: center; padding: 14px 32px; background: transparent; color: #94a3b8 !important; border: 1px solid #0c4a6e; border-radius: 9999px; font-size: 15px; font-weight: 600; transition: border-color 0.2s, color 0.2s; text-decoration: none !important; } .dolma-page .btn-cta-ghost:hover { border-color: var(--el-text-color-regular); color: #e2e8f0 !important; text-decoration: none !important; } .dolma-page code { background: #f0f9ff !important; padding: 2px 8px !important; border-radius: 5px !important; font-size: 13px !important; font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', monospace !important; color: #0c4a6e !important; border: 1px solid #7dd3fc !important; } .dolma-page .s-text-dark { color: var(--el-text-color-primary); } .dolma-page .s-text-brand { color: #0284c7; } .dolma-page .s-section-body { font-size: 16px; color: var(--el-text-color-regular); line-height: 1.8; text-align: center; max-width: 680px; margin: 0 auto; } .dolma-page .s-section-body p + p { margin-top: 16px; } .dolma-page .tag-row { display: flex; gap: 8px; flex-wrap: wrap; justify-content: center; margin-top: 16px; } .dolma-page .tag-item
{
padding: 4px 12px; background: var(--el-bg-color-page);
border: 1px solid var(--el-border-color-light); border-radius: 9999px;
font-size: 12px; font-weight: 600; color: var(--el-text-color-regular);
}
html.dark .dolma-page { background: var(--el-bg-color); color: var(--el-text-color-primary); }
html.dark .dolma-page a { color: inherit; }
html.dark .markdown-body .dolma-page a { color: inherit !important; }
html.dark .markdown-body .dolma-page a.s-btn-primary,
html.dark .markdown-body .dolma-page a.btn-cta-light { color: #ffffff !important; }
html.dark .markdown-body .dolma-page a.s-btn-secondary { color: var(--el-text-color-primary) !important; }
html.dark .markdown-body .dolma-page a.btn-cta-ghost { color: #94a3b8 !important; }
html.dark .markdown-body .dolma-page a.btn-cta-ghost:hover { color: var(--el-text-color-primary) !important; }
html.dark .dolma-page .s-bg-white { background: var(--el-bg-color); }
html.dark .dolma-page .s-bg-gray { background: var(--el-bg-color-page); }
html.dark .dolma-page .s-bg-dark { background: var(--el-bg-color); }
html.dark .dolma-page .s-header h2 { color: var(--el-text-color-primary); }
html.dark .dolma-page .s-header p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .s-btn-primary { background: #0284c7; color: #ffffff !important; }
html.dark .dolma-page .s-btn-primary:hover { background: #0369a1; }
html.dark .dolma-page .s-btn-secondary {
background: #1e293b; color: var(--el-text-color-primary) !important;
border-color: #475569;
}
html.dark .dolma-page .s-btn-secondary:hover { background: var(--el-border-color); border-color: var(--el-text-color-regular); }
html.dark .dolma-hero { background: var(--el-bg-color); }
html.dark .dolma-hero::before {
background: radial-gradient(ellipse, rgba(56, 189, 248, 0.15) 0%, transparent 70%);
}
html.dark .dolma-page .hero-badge { background: var(--el-bg-color-page); border-color: var(--el-border-color); color: var(--el-text-color-secondary); }
html.dark .dolma-hero h1 { color: var(--el-text-color-primary); }
html.dark .dolma-hero h1 span { color: #38bdf8; }
html.dark .dolma-page .hero-subtitle { color: var(--el-text-color-secondary); }
html.dark .dolma-page .hero-highlights .h-item { color: var(--el-text-color-secondary); }
html.dark .dolma-page .hero-highlights .h-div { background: var(--el-border-color); }
html.dark .dolma-stats { background: var(--el-bg-color-page); border-color: var(--el-border-color); }
html.dark .dolma-page .stat-val { color: var(--el-text-color-primary); }
html.dark .dolma-page .stat-lbl { color: var(--el-text-color-regular); }
html.dark .dolma-page .feat-card {
background: var(--el-bg-color-page); border-color: var(--el-border-color);
}
html.dark .dolma-page .feat-card:hover { border-color: var(--el-text-color-regular); box-shadow: 0 4px 16px rgba(0,0,0,0.3); }
html.dark .dolma-page .feat-card h3 { color: var(--el-text-color-primary); }
html.dark .dolma-page .feat-card p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .uc-card { background: var(--el-bg-color-page); border-color: var(--el-border-color); }
html.dark .dolma-page .uc-card:hover { border-color: var(--el-text-color-regular); box-shadow: 0 4px 16px rgba(0,0,0,0.3); }
html.dark .dolma-page .uc-card h3 { color: var(--el-text-color-primary); }
html.dark .dolma-page .uc-card p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .stp-num { color: #334155; }
html.dark .dolma-page .stp-card h3 { color: var(--el-text-color-primary); }
html.dark .dolma-page .stp-card p { color: var(--el-text-color-secondary); }
html.dark .dolma-page .stp-conn { background: var(--el-border-color); }
html.dark .dolma-page code {
background: #082f49 !important; color: #bae6fd !important; border-color: #0c4a6e !important;
}
html.dark .dolma-page .s-text-dark { color: var(--el-text-color-primary); }
html.dark .dolma-page .s-text-brand { color: #38bdf8; }
html.dark .dolma-page .s-section-body { color: var(--el-text-color-secondary); }
html.dark .dolma-page .tag-item { background: var(--el-border-color); border-color: var(--el-text-color-regular); color: var(--el-text-color-secondary); }
html.dark .dolma-cta { background: #082f49; }
html.dark .dolma-cta::before {
background: radial-gradient(ellipse, rgba(56, 189, 248, 0.2) 0%, transparent 70%);
}
html.dark .dolma-page .btn-cta-light { color: #ffffff !important; }
html.dark .dolma-page .btn-cta-ghost { color: #94a3b8 !important; }
html.dark .dolma-page .btn-cta-ghost:hover { color: var(--el-text-color-primary) !important; }
</style>
<div class="dolma-page">
<section class="dolma-hero">
<div class="s-container-narrow">
<div class="hero-badge">
<span class="badge-dot"></span>
Dolma Open Corpus
</div>
<h1>
Dolma Open<br/><span>Corpus</span>
</h1>
<p class="hero-subtitle">
Dolma is a large-scale open corpus created by Allen AI, containing 30 trillion tokens, integrating six major data sources: Common Crawl, The Stack, C4, Reddit, Wikipedia, and Semantic Scholar, used for training the OLMo series of language models, and is currently one of the most transparent large-scale pre-training datasets.
Dataset Highlights
A large-scale, multi-source, fully transparent open pre-training corpus
Trillions Scale
Contains approximately 30 trillion tokens of text data, making it one of the largest publicly available pre-training corpora, providing ample data support for training large language models.
Six Major Data Sources
Integrates six major sources: Common Crawl web pages, The Stack code, C4 filtered text, Reddit conversations, Wikipedia encyclopedia, and Semantic Scholar academic papers.
Fully Transparent
Allen AI has made the complete data collection, cleaning, deduplication, and filtering processes public, with each processing step being traceable and auditable, setting a new benchmark for dataset transparency.
Quality Filtering Pipeline
Employs a multi-level quality filtering pipeline, including language detection, content filtering, deduplication, and toxicity detection, ensuring the overall quality of the training data.
Reproducible Processing
All data processing code is open-sourced on GitHub, allowing researchers to fully reproduce the entire processing flow from raw data to final corpus.
Open License
Utilizes the ODC-By 1.0 open data license, allowing free use for academic research and commercial applications, with proper attribution required.
Applicable Scenarios
Empowering the AI community from model training to data science research
LLM Pre-training
Serves as the core pre-training corpus for large language models, providing diverse and large-scale text data for training foundational models from scratch.
Data Ratio Research
Explores the optimal mixing ratios of different data sources, studying the impact of web pages, code, encyclopedias, academic papers, etc., on model capabilities.
Ablation Experiments
Systematically studies the independent contributions of each data component to model performance by removing or replacing specific data sources.
Reproducible AI Research
Based on fully open data and processing flows, ensuring that research results are verifiable and reproducible, promoting scientific rigor in the AI field.
API Call Example
Quickly obtain Dolma dataset information through the Ace Data Cloud API
import requestsurl = "https://api.acedata.cloud/datasets/dolma" headers = { "Authorization": "Bearer YOUR_API_TOKEN", "Accept": "application/json" }
response = requests.get(url, headers=headers) data = response.json()
View Basic Information of the Dataset
print(f"Name: {data['name']}") print(f"Number of Tokens: {data['tokens']}") print(f"Data Sources: {data['sources']}") print(f"License Agreement: {data['license']}")
3 Steps to Get Started Quickly
From understanding to usage, quickly start your journey with large model training data
Browse the Dataset
View the details of the Dolma dataset on the Ace Data Cloud platform, understand the composition of data sources, token scale, and license agreement.
Obtain API Token
Register and obtain your API Token to access the dataset through the api.acedata.cloud interface.
Download and Train
Download the required data shards via API and start your pre-training or research experiments with the Dolma corpus.
Start Exploring the Dolma Open Corpus
30 trillion tokens, 6 major data sources, completely transparent processing flow. Whether you are training the next generation of language models or conducting data science research, Dolma is the ideal choice.
