ClearML vs. SageMaker: What's the Difference?

AWS SageMaker is a managed MLOps platform that abstracts infrastructure but locks you into AWS APIs and per-instance billing. ClearML is an open-source alternative you self-host on any cloud or on-premises hardware. It eliminates per-minute compute charges, removes AWS API dependencies, and automatically captures metrics, hyperparameters, and artifacts from existing training code without modification.

ClearML maps directly to SageMaker components:

AWS SageMakerClearML Equivalent
SageMaker StudioClearML Web UI
SageMaker ExperimentsClearML Experiment Manager
SageMaker Training JobsClearML Agent + Tasks
SageMaker PipelinesClearML Pipelines
SageMaker Model RegistryClearML Model Repository
SageMaker EndpointsClearML Serving (Triton)
CloudWatch MetricsClearML Scalars/Plots

Prerequisites

You need a Linux server (Ubuntu recommended) with sudo access, Docker and Docker Compose installed, and DNS A records pointing to your server's IP for three subdomains: app.clearml.example.com, api.clearml.example.com, and files.clearml.example.com. GPU workloads require the NVIDIA Container Toolkit on agent machines.

Deploy ClearML Server with Docker Compose

ClearML Server runs as a multi-container stack with Elasticsearch, MongoDB, and Redis. First, increase the virtual memory limit for Elasticsearch:

echo "vm.max_map_count=524288" | sudo tee /etc/sysctl.d/99-clearml.conf
sudo sysctl --system
sudo systemctl restart docker

Create persistent storage directories:

sudo mkdir -p /opt/clearml/{data/elastic_7,data/mongo_4/db,data/mongo_4/configdb,data/redis,data/fileserver,logs,config}
sudo chown -R 1000:1000 /opt/clearml

Download the official Docker Compose file:

curl -fsSL https://raw.githubusercontent.com/clearml/clearml-server/master/docker/docker-compose.yml -o docker-compose.yml

Edit docker-compose.yml to comment out direct port mappings for apiserver, webserver, and fileserver (Traefik will handle routing). Update the networks block to use named bridge networks:

networks:
  backend:
    name: clearml_backend
    driver: bridge
  frontend:
    name: clearml_frontend
    driver: bridge

Create an environment file for service URLs (replace clearml.example.com with your domain):

CLEARML_WEB_HOST=https://app.clearml.example.com
CLEARML_API_HOST=https://api.clearml.example.com
CLEARML_FILES_HOST=https://files.clearml.example.com

Start the services:

docker compose up -d

Verify all containers are running: clearml-webserver, clearml-apiserver, clearml-fileserver, clearml-mongo, clearml-elastic, and clearml-redis.

Configure Traefik Reverse Proxy

Traefik routes HTTPS traffic to ClearML services using subdomain-based routing with automatic Let's Encrypt certificates.

Create the Traefik directory and set up Let's Encrypt storage:

mkdir -p ~/clearml/traefik/letsencrypt
cd ~/clearml/traefik
touch letsencrypt/acme.json
chmod 600 letsencrypt/acme.json

Create a .env file with your email:

LETSENCRYPT_EMAIL=admin@example.com

Create docker-compose.yml for Traefik:

services:
  traefik:
    image: traefik:v3.6
    container_name: traefik
    command:
      - "--log.level=INFO"
      - "--providers.file.filename=/etc/traefik/dynamic_conf.yml"
      - "--entryPoints.web.address=:80"
      - "--entryPoints.websecure.address=:443"
      - "--entryPoints.web.http.redirections.entrypoint.to=websecure"
      - "--certificatesResolvers.le.acme.httpChallenge.entryPoint=web"
      - "--certificatesResolvers.le.acme.email=${LETSENCRYPT_EMAIL}"
      - "--certificatesResolvers.le.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "./letsencrypt:/letsencrypt"
      - "./dynamic_conf.yml:/etc/traefik/dynamic_conf.yml:ro"
    networks:
      - clearml-frontend
    restart: unless-stopped

networks:
  clearml-frontend:
    name: clearml_frontend
    external: true

Create dynamic_conf.yml with routing rules (replace clearml.example.com):

http:
  routers:
    clearml-web:
      rule: "Host(`app.clearml.example.com`)"
      entryPoints:
        - websecure
      service: clearml-web
      tls:
        certResolver: le
    clearml-api:
      rule: "Host(`api.clearml.example.com`)"
      entryPoints:
        - websecure
      service: clearml-api
      tls:
        certResolver: le
    clearml-files:
      rule: "Host(`files.clearml.example.com`)"
      entryPoints:
        - websecure
      service: clearml-files
      tls:
        certResolver: le
  services:
    clearml-web:
      loadBalancer:
        servers:
          - url: "http://clearml-webserver:80"
    clearml-api:
      loadBalancer:
        servers:
          - url: "http://clearml-apiserver:8008"
    clearml-files:
      loadBalancer:
        servers:
          - url: "http://clearml-fileserver:8081"

Start Traefik:

docker compose up -d

Verify certificates:

docker logs traefik 2>&1 | grep -i certificate

Configure ClearML Server

Open https://app.clearml.example.com in a browser. Create the admin account (username, company name). Navigate to Settings > Workspace > Create new credentials to generate API credentials. Save the credentials block:

api {
  web_server: https://app.clearml.example.com
  api_server: https://api.clearml.example.com
  files_server: https://files.clearml.example.com
  credentials {
    "access_key" = "YOUR_ACCESS_KEY"
    "secret_key" = "YOUR_SECRET_KEY"
  }
}

Deploy ClearML Agents

ClearML Agent turns any machine into a remote worker. Install on the same server or a dedicated GPU instance:

mkdir -p ~/clearml-agent && cd ~/clearml-agent
sudo apt install python3.12-venv -y
python3 -m venv clearml_venv
source clearml_venv/bin/activate
pip install clearml-agent

Initialize the agent with clearml-agent init and paste the credentials block. Start the agent on the default queue:

clearml-agent daemon --queue default --detached

For GPU workloads, specify GPU indices:

clearml-agent daemon --gpus 0,1 --queue default --detached

Verify the agent appears in the ClearML Web UI under Workers & Queues.

Install ClearML SDK and Run an Experiment

In the same virtual environment, install the SDK:

pip install clearml scikit-learn joblib pandas

Configure the SDK with clearml init (paste credentials). Create an experiment script:

import joblib
from clearml import Task
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Task.add_requirements('scikit-learn')
Task.add_requirements('joblib')

task = Task.init(project_name='Tutorial', task_name='Random Forest Iris')

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

params = {'n_estimators': 100, 'max_depth': 3}
task.connect(params)

clf = RandomForestClassifier(**params)
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print(f'Accuracy: {accuracy}')

joblib.dump(clf, 'model.pkl')
task.upload_artifact('model', 'model.pkl')

Run the script:

python 01_first_experiment.py

View the experiment in the ClearML Web UI under Projects > Tutorial.

Why It Matters

ClearML gives you a self-hosted MLOps stack that matches SageMaker's capabilities without vendor lock-in or per-minute compute costs. You can run it on any cloud or on-premises, and it automatically captures experiment metadata from existing code. This tutorial shows a complete deployment from server setup to running your first tracked experiment.