Back to blog
EngineeringFebruary 2025

Scaling Preference Elicitation

The Engineering Problem

Collecting human preference data sounds simple: show people two things, ask which they prefer, record the answer. In practice, every decision in the elicitation pipeline affects data quality and interpretability. Presentation order, stimulus pairing strategy, response timing, fatigue effects, anchoring bias—all of these introduce systematic distortion that the downstream model will absorb unless controlled.

This post covers the engineering decisions behind CommandAGI's elicitation pipeline: three modalities, quality control for subjective data, version control for evolving preferences, and the interface design that keeps the signal clean.

Three Modalities, One Profile

We collect preferences through three complementary modalities, each probing different aspects of the preference landscape at different information-per-response rates.

labeling interface
Frame under reviewframe_0042.jpg
42/56
~10 per profile

Questions

High-level constraints that partition preference space coarsely. Binary or categorical: “Do you prefer warm or cool tones?” Each question bisects the space, ruling out large regions. Cheap to answer (~5s each) and provide scaffolding for the more precise modalities.
~30 per profile

Labels

The annotator sees a single stimulus and marks it on a quality scale (1-4). Gives absolute anchoring—where “good enough” lives for this person. More informative per item than comparisons, but susceptible to scale bias and anchoring effects.
~100 per profile

Comparisons

Forced choices between pairs. Each comparison yields one bit of ordinal information about local preference topology. Pairs selected adaptively: the system picks the pair where the current model has maximum uncertainty (maximum expected information gain).
Definition
Adaptive pair selection: Given a current estimate of the latent utility surface and its uncertainty, select the pair (i,j) that maximizes expected reduction in posterior entropy. This is equivalent to choosing the pair where |u_i - u_j| is smallest under the current model—the pair the model is least sure about. In practice, we maintain a Bayesian posterior over Bradley-Terry parameters and select pairs by Thompson sampling.

Quality Control for Subjective Data

The failure mode specific to subjective data is inconsistency. If someone prefers A to B, B to C, and C to A, the Bradley-Terry model can't fit it cleanly. Some cyclicity is natural—preferences are genuinely intransitive in certain domains. But excessive cyclicity signals inattention.

Three quality metrics run in real time:

Metric 1

Transitivity Ratio

Fraction of comparison triples that satisfy transitivity. For random responses: ~0.75. For attentive annotators: >0.90. Below 0.85 triggers a quality flag. Computed on a rolling window of the last 50 comparisons.
Metric 2

Test-Retest Reliability

A small fraction of comparisons (~5%) are repeated later in the session. The agreement rate between first and second presentation measures temporal consistency. Below 0.70 suggests the annotator isn't engaging with the stimuli.
Metric 3

Inter-Annotator Agreement

Calibration stimuli are shown to multiple annotators. Krippendorff's alpha on these shared items separates universal structure from idiosyncratic noise. Also identifies systematic annotator biases (position preference, recency bias).

Low-quality annotations get flagged in real time. Annotators with persistently low scores get retrained or removed. There is no batch review step where bad data accumulates undetected—the quality pipeline runs continuously against the same posterior that drives pair selection.

Version Control for Evolving Preferences

Preferences change. What you found compelling last year isn't necessarily what compels you now. A static model misses this—and misses the fact that preference evolution is itself informative data. The trajectory of taste through preference space reveals which aesthetic attractors are stable and which transitions are common.

The Interface

Annotation interfaces affect data quality. A slow, confusing interface produces noisy data. We optimized for speed and low cognitive overhead: keyboard shortcuts for all actions (1-4 to select quality, S to skip, arrow keys to navigate), minimal visual clutter, large stimulus presentation, and immediate feedback.

Insight
The signal we want is pre-reflective. The preference you report after deliberation is contaminated by your theory of what you should prefer. The preference reported in 800ms is closer to the raw geometry. The interface is designed to capture that fast response: large stimuli, instant transitions, no confirmation dialogs. Stimulus exposure is timed, and response latency is recorded as a data quality signal (unusually fast or slow responses are down-weighted in the posterior update).

Multi-Modal Coverage

The platform supports annotation across modalities—images, video, text, code, websites, audio, documents. Each modality has its own rendering pipeline and interaction patterns. Video uses frame extraction with configurable sampling rates. Audio presents waveform visualization with playback controls. Code uses syntax-highlighted side-by-side diffs. Each modality probes different dimensions of the underlying experiential geometry.

Scale

A single taste profile requires ~140 total responses (10 questions + 30 labels + 100 comparisons). A professional annotator completes a profile in about 12 minutes. The marketplace currently supports concurrent annotation by multiple annotators against the same stimulus set, with inter-annotator agreement tracked per calibration item.

Every profile is a partial observation of one person's preference geometry. At scale, partial observations constrain the underlying structure. A million comparisons across a thousand people reveals the empirical distribution of human preference—and that distribution has structure. It clusters in ways that reflect shared human architecture.

Key Takeaway
The pipeline is designed around one principle: minimize the distance between the annotator's raw preference response and the data that enters the model. Every interface decision, every quality check, every adaptive pair selection serves that goal. The less contamination between experience and data, the better the calibration.