ML Training

Reward Model Training

Train reward models directly from human preference data collected on the platform. Configure training runs, evaluate model quality against held-out sets, and deploy models for RLHF alignment pipelines.

Key Capabilities

Preference Data Collection

Collect pairwise comparison data optimized for reward model training. The platform structures comparisons to maximize signal, handles tie-breaking, and ensures balanced coverage across your content space.

Model Training

Launch reward model training runs with configurable base models, hyperparameters, and data filters. Training executes on managed A100 GPUs with automatic checkpointing and early stopping.

Evaluation

Every trained model is evaluated against held-out gold sets and cross-validated with inter-annotator agreement data. View accuracy curves, calibration plots, and per-dimension breakdowns in the dashboard.

Deployment

Deploy trained reward models to a managed inference endpoint with sub-50ms latency. Use the scoring API to evaluate new content in real time or batch-score datasets for downstream RLHF training.

Usage

Configure and launch reward model training

curl -X POST https://api.commandagi.com/v1/training/reward-model \
  -H "Authorization: Bearer $COMMANDAGI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "prof_main",
    "base_model": "meta-llama/Llama-3-8B",
    "data": {
      "comparison_min": 5000,
      "val_split": 0.1,
      "gold_filter": true
    },
    "hyperparams": {
      "learning_rate": 1e-5,
      "epochs": 3,
      "batch_size": 32
    },
    "compute": { "gpu": "A100-40GB", "max_hours": 4 }
  }'

# Response
# {
#   "run_id": "run_rm_8k2x",
#   "status": "queued",
#   "estimated_start": "2025-01-15T10:30:00Z"
# }

Ready to get started?

Create your first preference profile in minutes.

Start Building Read the Docs