WSC273 Pronoun Resolution
Dataset
The Winograd Schema Challenge contains 273 carefully designed pronoun resolution problems that require common sense knowledge to determine referents, serving as a classic benchmark for evaluating AI language understanding capabilities.
Dataset Highlights
A classic AI benchmark with carefully designed pronoun resolution challenges
Alternative to Turing Test
Proposed by Hector Levesque as an alternative to the Turing Test, the problems are intricately designed, impossible to solve via statistical patterns or simple heuristics, truly testing language understanding.
Paired Sentence Design
Each pair of questions forms a pair, with only one keyword changed to flip the correct answer, ensuring models cannot cheat with shallow co-occurrence statistics and must truly understand semantics.
Driven by Common Sense Knowledge
Correct answers require physical intuition, social common sense, and causal reasoning, making it a gold standard for evaluating deep language understanding in AI systems.
Strict Format Control
All questions follow a uniform sentence structure: a sentence with an ambiguous pronoun, two candidate referents, and one correct answer, facilitating standardized evaluation.
Academic Classic
Originally from Terry Winograd’s 1972 example, systematically extended by Levesque et al. in 2012, widely cited and recognized in NLP academia.
Fair Evaluation Benchmark
The random guess accuracy is exactly 50%, eliminating biases caused by dataset imbalance, so model scores truly reflect their common sense reasoning ability.
Applicable Scenarios
Widely applicable from language understanding research to AI system evaluation
Train and evaluate models to correctly identify entities referred to by pronouns within sentences, a core task in natural language understanding.
Common Sense Reasoning
Test whether AI systems possess multi-dimensional common sense knowledge such as physics, social, and causal understanding, measuring deep semantic comprehension.
Pronoun Disambiguation
Resolve ambiguous pronoun references in natural language, crucial for machine translation, dialogue systems, and information extraction.
AI Completeness Testing
As an alternative to Turing tests, used to evaluate whether AI systems reach human-level language understanding capabilities.
Data Preview
Below are sample questions from the WSC273 dataset, demonstrating how keyword changes in paired sentences flip the correct answer
# Example 1 (Paired Sentences) Sentence: The trophy doesn't fit into the brown suitcase because it is too large. Pronoun: it Candidate A: trophy Candidate B: suitcase Correct Answer: A (trophy)Sentence: The trophy doesn't fit into the brown suitcase because it is too small. Pronoun: it Candidate A: trophy Candidate B: suitcase Correct Answer: B (suitcase)
# Example 2 (Paired Sentences) Sentence: Joan made sure to thank Susan for all the help she had given. Pronoun: she Candidate A: Joan Candidate B: Susan Correct Answer: B (Susan)
Sentence: Joan made sure to thank Susan for all the help she had received. Pronoun: she Candidate A: Joan Candidate B: Susan Correct Answer: A (Joan)
# Example 3 Sentence: The city councilmen refused the demonstrators a permit because they feared violence. Pronoun: they Candidate A: councilmen Candidate B: demonstrators Correct Answer: A (councilmen)
3-Step Quick Start
From browsing to analysis, start your NLP research project in minutes
Browse the Dataset
View dataset details on the Ace Data Cloud platform, including question format, field descriptions, and license information.
Download Data
Obtain the complete dataset containing 273 pronoun resolution problems, each with sentences, pronouns, candidate referents, and correct answers.
Load and Evaluate
Use json.load() or pandas.read_json() to load data and evaluate your language models’ common sense reasoning.
Start Exploring the WSC273 Pronoun Resolution Data
A classic AI benchmark with open licensing, available for immediate download. Whether you're an NLP researcher or an AI system developer, this dataset is an essential tool for evaluating language understanding capabilities.
