SNLI Dataset

SNLI Natural Language Inference
Dataset

The first large-scale natural language inference dataset, containing 570,000 manually crafted sentence pairs labeled with entailment, contradiction, and neutral relations. A foundational benchmark in the NLI field.

570,000 Sentence Pairs 3 Label Types CC BY-SA 4.0 License Bowman et al. (2015)
SNLI Dataset
📊
570K
Total Sentence Pairs
🏷️
3
Label Categories
👥
5
Annotators per Sample
📜
CC BY-SA 4.0
Open License

Dataset Highlights

A foundational benchmark dataset in NLI, driving progress in natural language understanding research

🌍

Large-Scale Manual Annotation

All 570,000 sentence pairs were manually written by crowdworkers, not automatically generated. Each data point was independently annotated by 5 annotators to ensure high quality and consistency.

🔗

Three Semantic Relations

Covers the three core inference relations: entailment, contradiction, and neutral, encompassing fundamental dimensions of natural language understanding.

🎯

Foundational Benchmark

As the first large-scale NLI dataset, SNLI is widely used to evaluate and compare various natural language understanding models, advancing models like BERT and GPT.

📷

Vision-Based Text

Premise sentences are sourced from Flickr30K image captions, naturally possessing scene-based and concrete characteristics, providing a bridge for multimodal research.

📖

Annotator Consistency

Each sample retains independent labels from 5 annotators, supporting inter-annotator agreement analysis, suitable for meta-learning and data quality research.

🏛️

Academic Authority

Released by the Stanford NLP group, the paper has been cited over ten thousand times. It is one of the most influential datasets in NLP, widely adopted in academia and industry.

Applicable Scenarios

From fundamental research to industrial applications, covering the full natural language understanding pipeline

🧠

Natural Language Inference

Train and evaluate NLI models to determine entailment, contradiction, or neutral relations between two sentences

📐

Sentence Embeddings

Use sentence pair relations to train high-quality sentence vector representations, improving semantic similarity and retrieval performance

🔄

Transfer Learning

Fine-tune models like BERT and RoBERTa on this pretraining task to enhance downstream NLU task performance

Textual Entailment Detection

Build core reasoning modules for applications such as fact verification, question answering, and text consistency checking

Natural Language Inference Textual Entailment Sentence Understanding Stanford NLP Benchmark Dataset

Data Preview

Below is a JSON format example from the SNLI dataset, showing the premise, hypothesis, and label fields

JSON
[
  {
    "premise": "A person on a horse jumps over a broken down airplane.",
    "hypothesis": "A person is training his horse for a competition.",
    "label": "neutral",
    "annotator_labels": ["neutral", "entailment", "neutral", "neutral", "neutral"]
  },
  {
    "premise": "A person on a horse jumps over a broken down airplane.",
    "hypothesis": "A person is at a diner, ordering an omelette.",
    "label": "contradiction",
    "annotator_labels": ["contradiction", "contradiction", "contradiction", "contradiction", "contradiction"]
  },
  {
    "premise": "A person on a horse jumps over a broken down airplane.",
    "hypothesis": "A person is outdoors, on a horse.",
    "label": "entailment",
    "annotator_labels": ["entailment", "entailment", "entailment", "entailment", "entailment"]
  },
  {
    "premise": "Children smiling and waving at camera.",
    "hypothesis": "They are smiling at their parents.",
    "label": "neutral",
    "annotator_labels": ["neutral", "neutral", "neutral", "neutral", "entailment"]
  },
  {
    "premise": "Children smiling and waving at camera.",
    "hypothesis": "The kids are frowning.",
    "label": "contradiction",
    "annotator_labels": ["contradiction", "contradiction", "contradiction", "contradiction", "contradiction"]
  }
]

3 Steps to Get Started Quickly

From browsing to research, start your NLI experiments in minutes

01

Browse the Dataset

View dataset details on the Ace Data Cloud platform, including field descriptions, label distributions, and license information.

02

Download Data

Obtain the SNLI dataset's train/dev/test splits, containing 570,000 sentence pairs in JSON format, ready to use out of the box.

03

Load and Train

Use datasets.load_dataset("snli") or directly load the JSON to start training and evaluating NLI models.

Start Exploring Natural Language Inference Data

A foundational benchmark in NLI, open licensed, available for immediate download. Whether you are an NLP researcher or a deep learning engineer, SNLI is an indispensable experimental cornerstone.