ML Training
Reward Model Training
Train reward models directly from human preference data collected on the platform. Configure training runs, evaluate model quality against held-out sets, and deploy models for RLHF alignment pipelines.
Key Capabilities
Preference Data Collection
Collect pairwise comparison data optimized for reward model training. The platform structures comparisons to maximize signal, handles tie-breaking, and ensures balanced coverage across your content space.
Model Training
Launch reward model training runs with configurable base models, hyperparameters, and data filters. Training executes on managed A100 GPUs with automatic checkpointing and early stopping.
Evaluation
Every trained model is evaluated against held-out gold sets and cross-validated with inter-annotator agreement data. View accuracy curves, calibration plots, and per-dimension breakdowns in the dashboard.
Deployment
Deploy trained reward models to a managed inference endpoint with sub-50ms latency. Use the scoring API to evaluate new content in real time or batch-score datasets for downstream RLHF training.
Usage
curl -X POST https://api.commandagi.com/v1/training/reward-model \
-H "Authorization: Bearer $COMMANDAGI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"profile_id": "prof_main",
"base_model": "meta-llama/Llama-3-8B",
"data": {
"comparison_min": 5000,
"val_split": 0.1,
"gold_filter": true
},
"hyperparams": {
"learning_rate": 1e-5,
"epochs": 3,
"batch_size": 32
},
"compute": { "gpu": "A100-40GB", "max_hours": 4 }
}'
# Response
# {
# "run_id": "run_rm_8k2x",
# "status": "queued",
# "estimated_start": "2025-01-15T10:30:00Z"
# }