AI training data labeling. Revenue collapsed.
Appen operated the world's largest human data annotation workforce — over 1 million contractors labeling images, text, and audio to train AI models. Ironically, the AI models they helped build became capable of generating their own training data. RLHF (Reinforcement Learning from Human Feedback) automation and synthetic data generation destroyed demand for manual labeling. Revenue fell from $461M to under $100M. The stock collapsed 96% from its peak.
AI companies moved to synthetic data generation and automated RLHF pipelines, eliminating the need for Appen's army of human data labelers.
Peak: $461M revenue, 1M+ crowd workers
Major AI labs begin synthetic data experiments
Revenue begins declining, contract sizes shrink
Mass layoffs, revenue under $200M
Stock -96%, emergency restructuring
Effectively a zombie company
Delisted from ASX, remaining contracts wound down
Replace manual data labeling with LLM-powered annotation pipelines. Use frontier models to generate synthetic training data, auto-label datasets, and evaluate model outputs — reducing annotation costs by 90%+.
Define your annotation schema (categories, entity types, scoring rubrics)
Write a detailed system prompt with labeling instructions + 5-10 few-shot examples
Process your dataset through the Claude or GPT-4 API in batches
Route low-confidence outputs to human reviewers in Label Studio
Calculate inter-annotator agreement between AI labels and a human gold set
Iterate on the prompt until agreement exceeds your quality threshold (typically 90%+)
You are a data annotator. Classify the following text into exactly one category: [POSITIVE, NEGATIVE, NEUTRAL]. Respond with only the label. Text: "{{text}}" Label:
Generate 20 diverse examples of customer support conversations about [topic]. Each should include: a realistic customer message, the ideal agent response, and a sentiment label. Vary the tone, complexity, and customer frustration level. Output as JSON.
Given the following prompt and two model responses, determine which response is better. Consider helpfulness, accuracy, safety, and conciseness. Prompt: {{prompt}} Response A: {{response_a}} Response B: {{response_b}} Better response (A or B): Reasoning:
For RLHF: generate preference pairs using two model variants, then use AI as the judge