Data Labeling & AI
Incentivize truthful data labeling for AI training without centralized review pipelines.
The Problem
AI training requires massive labeled datasets. Current approaches:
- Human review — expensive ($0.10–$2.00 per label), inconsistent, slow
- Crowdsourcing — quality varies wildly, gaming incentives
- Expert panels — doesn't scale, bottleneck
- LLM self-labeling — circular, amplifies biases
The core issue: how do you verify label quality without a ground truth oracle? This is exactly the problem the SKC mechanism was designed to solve.
How Yiling Solves This
Each labeling task becomes a market. Labelers post bonds and submit their assessments. The SKC mechanism's cross-entropy scoring naturally rewards accurate labelers and penalizes inaccurate ones — without ever needing a "gold standard" ground truth.
Task: "Is this image NSFW?" / "Is this text toxic?"
↓
Labelers submit probability assessments with bonds
↓
SKC resolves → consensus label + quality scores per labeler
↓
Use scores to weight labels and build reputation
Why This Works
The SKC mechanism is a form of information elicitation without verification. The key insight from the Harvard research:
"A reference agent with access to more information can serve as a reasonable proxy for the ground truth."
Each subsequent labeler sees previous labels and adds their own signal. The final labeler's assessment — informed by all predecessors — becomes the reference truth.
Applications
- Content moderation — toxic, NSFW, misinformation detection
- RLHF data — preference labels for AI alignment
- Medical imaging — diagnostic label consensus
- Fact-checking — claim verification
- Sentiment analysis — subjective classification at scale