End-to-end model training

ML Training Pipeline

Train reward models, run SFT, DPO, and PPO pipelines, and evaluate results -- all from preference data collected on the platform. Managed GPU infrastructure means no ops overhead.

Key Capabilities

Reward Model Training

Train reward models directly from pairwise comparison data. The platform handles data preprocessing, train/val splits, checkpoint selection, and evaluation against held-out gold sets.

SFT, DPO & PPO

Run supervised fine-tuning, direct preference optimization, or proximal policy optimization pipelines. Configure hyperparameters through the API or dashboard and monitor loss curves in real time.

Managed GPU Infrastructure

Training runs execute on A100 and H100 GPUs provisioned on demand. No cloud accounts to manage. Jobs queue automatically and scale down when complete to minimize cost.

Evaluation & Versioning

Every training run produces versioned model artifacts with evaluation metrics. Compare runs side-by-side, promote the best model to production, and roll back if alignment regresses.

Usage

Start a reward model training run
curl -X POST https://api.commandagi.com/v1/training/reward-model \
  -H "Authorization: Bearer $COMMANDAGI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "profile_id": "prof_main",
    "base_model": "meta-llama/Llama-3-8B",
    "data": {
      "comparison_min": 5000,
      "val_split": 0.1
    },
    "compute": {
      "gpu": "A100-40GB",
      "max_hours": 4
    },
    "callbacks": {
      "on_complete": "https://your-app.com/hooks/training-done"
    }
  }'

Ready to get started?

Create your first preference profile in minutes.