Scaling Preference Elicitation
The Engineering Problem
Collecting human preference data sounds simple: show people two things, ask which they prefer, record the answer. In practice, every decision in the elicitation pipeline affects data quality and interpretability. Presentation order, stimulus pairing strategy, response timing, fatigue effects, anchoring bias—all of these introduce systematic distortion that the downstream model will absorb unless controlled.
This post covers the engineering decisions behind CommandAGI's elicitation pipeline: three modalities, quality control for subjective data, version control for evolving preferences, and the interface design that keeps the signal clean.
Three Modalities, One Profile
We collect preferences through three complementary modalities, each probing different aspects of the preference landscape at different information-per-response rates.
frame_0042.jpgQuestions
Labels
Comparisons
|u_i - u_j| is smallest under the current model—the pair the model is least sure about. In practice, we maintain a Bayesian posterior over Bradley-Terry parameters and select pairs by Thompson sampling.Quality Control for Subjective Data
The failure mode specific to subjective data is inconsistency. If someone prefers A to B, B to C, and C to A, the Bradley-Terry model can't fit it cleanly. Some cyclicity is natural—preferences are genuinely intransitive in certain domains. But excessive cyclicity signals inattention.
Three quality metrics run in real time:
Transitivity Ratio
Test-Retest Reliability
Inter-Annotator Agreement
Low-quality annotations get flagged in real time. Annotators with persistently low scores get retrained or removed. There is no batch review step where bad data accumulates undetected—the quality pipeline runs continuously against the same posterior that drives pair selection.
Version Control for Evolving Preferences
Preferences change. What you found compelling last year isn't necessarily what compels you now. A static model misses this—and misses the fact that preference evolution is itself informative data. The trajectory of taste through preference space reveals which aesthetic attractors are stable and which transitions are common.
The Interface
Annotation interfaces affect data quality. A slow, confusing interface produces noisy data. We optimized for speed and low cognitive overhead: keyboard shortcuts for all actions (1-4 to select quality, S to skip, arrow keys to navigate), minimal visual clutter, large stimulus presentation, and immediate feedback.
Multi-Modal Coverage
The platform supports annotation across modalities—images, video, text, code, websites, audio, documents. Each modality has its own rendering pipeline and interaction patterns. Video uses frame extraction with configurable sampling rates. Audio presents waveform visualization with playback controls. Code uses syntax-highlighted side-by-side diffs. Each modality probes different dimensions of the underlying experiential geometry.
Scale
A single taste profile requires ~140 total responses (10 questions + 30 labels + 100 comparisons). A professional annotator completes a profile in about 12 minutes. The marketplace currently supports concurrent annotation by multiple annotators against the same stimulus set, with inter-annotator agreement tracked per calibration item.
Every profile is a partial observation of one person's preference geometry. At scale, partial observations constrain the underlying structure. A million comparisons across a thousand people reveals the empirical distribution of human preference—and that distribution has structure. It clusters in ways that reflect shared human architecture.