Workspace Layout

A workspace is a directory where experimaestro stores all experiment data, job outputs, and metadata. This page describes the directory structure and explains the purpose of each component.

Directory Structure

WORKSPACE_DIR/
├── .__experimaestro__              # Marker file indicating this is a workspace
├── experiments/                     # Experiment run directories
│   └── {experiment-id}/            # One directory per experiment
│       ├── lock                    # Lock file (prevents concurrent runs)
│       └── {run-id}/               # One directory per run
│           ├── environment.json    # Python environment and git info
│           ├── status.json         # Experiment metadata and services state
│           ├── jobs.jsonl          # Lightweight job information (one per line)
│           ├── jobs/               # Symlinks to job directories
│           ├── results/            # Saved experiment results
│           └── data/               # Serialized configurations
├── jobs/                           # Actual job output directories
│   └── {task-type-id}/
│       └── {job-hash}/
│           ├── params.json         # Job parameters
│           ├── {scriptname}.pid    # Process info (PID, type) while running
│           ├── {scriptname}.done   # Marker file when job succeeds
│           ├── {scriptname}.failed # Marker file when job fails (with reason)
│           ├── {scriptname}.out    # Standard output
│           ├── {scriptname}.err    # Standard error
│           ├── locks.json          # Dynamic dependency locks (tokens)
│           └── .experimaestro/
│               ├── status.json     # Job state, timestamps, progress
│               ├── {scriptname}.lock  # Job lock file
│               ├── task-outputs.jsonl # Dynamic task output events
│               └── events/         # Permanent event storage (after archival)
│                   └── event-{count}.jsonl
├── .events/                        # Temporary event files (watched by scheduler)
│   ├── experiments/
│   │   └── {experiment-id}/
│   │       ├── current             # Symlink to current run directory
│   │       └── events-{count}.jsonl
│   └── jobs/
│       └── {task-type-id}/
│           └── event-{job-id}-{count}.jsonl
├── partials/                       # Shared partial directories
│   └── {task-type-id}/
│       └── {partial-name}/
│           └── {partial-hash}/
├── config/                         # Configuration cache
└── .experimaestro/
    └── experiments/
        ├── events-{count}@{experiment-id}.jsonl  # Event log (active experiments)
        └── {experiment-id}                       # Symlink to current run

Run ID Format

Each experiment run is identified by a run ID based on the timestamp when the experiment started:

  • Format: YYYYMMDD_HHMMSS (e.g., 20250108_143022)

  • If multiple runs start within the same second, a suffix is added: 20250108_143022.1, 20250108_143022.2, etc.

This format ensures:

  • Runs are naturally sorted chronologically

  • Each run has a unique identifier

  • Run IDs are human-readable

Experiment Lock

The lock file at experiments/{experiment-id}/lock prevents multiple instances of the same experiment from running simultaneously. When you start an experiment:

  1. Experimaestro acquires an exclusive lock on this file

  2. If another process holds the lock, a warning is displayed with the hostname of the holder (if available)

  3. The process waits until the lock is released

This ensures data integrity and prevents race conditions.

Environment Information

The environment.json file captures the complete runtime environment when the experiment starts:

{
  "python_version": "3.10.12",
  "packages": {
    "experimaestro": "2.0.0",
    "torch": "2.1.0",
    ...
  },
  "editable_packages": {
    "my-project": {
      "version": "0.1.0",
      "path": "/home/user/my-project",
      "git": {
        "branch": "main",
        "commit": "abc123...",
        "dirty": false
      }
    }
  },
  "projects": [...],
  "run": {
    "hostname": "compute-node-01",
    "started_at": "2025-01-08T14:30:22",
    "ended_at": "2025-01-08T15:45:10",
    "status": "completed"
  }
}

This information is essential for reproducibility - you can recreate the exact environment that was used for any experiment run.

History Cleanup

Experimaestro automatically manages experiment history to prevent disk space accumulation. The cleanup behavior is controlled by settings:

# In ~/.config/experimaestro/settings.yaml

# Global defaults
history:
  max_done: 5      # Keep last 5 successful runs per experiment
  max_failed: 1    # Keep last 1 failed run per experiment

workspaces:
  - id: my-workspace
    path: ~/experiments
    # Override for this workspace
    history:
      max_done: 10
      max_failed: 2

Cleanup rules:

  • When an experiment succeeds, all previous failed runs are removed

  • Only the most recent max_done successful runs are kept

  • Only the most recent max_failed failed runs are kept

  • Runs with unknown status are never automatically deleted

v1 Experiment Layout

Experimaestro v2 can read experiments created with v1 (the xp/ directory layout). The state provider automatically detects and handles both layouts:

v1 layout (legacy):

WORKSPACE_DIR/
└── xp/
    └── {experiment-id}/
        ├── jobs/           # Symlinks to job directories (current run)
        └── jobs.bak/       # Symlinks to job directories (previous run)

v2 layout (current):

WORKSPACE_DIR/
└── experiments/
    └── {experiment-id}/
        └── {run-id}/
            ├── status.json
            └── jobs.jsonl

Migration (Optional)

If you want to migrate v1 experiments to v2 layout:

# Preview what will be migrated
experimaestro migrate v1-to-v2 /path/to/workspace --dry-run

# Perform the migration
experimaestro migrate v1-to-v2 /path/to/workspace

# Keep remaining files (renamed to xp_MIGRATED_TO_V2)
experimaestro migrate v1-to-v2 /path/to/workspace --keep-old

The migration:

  • Moves each experiment from xp/{exp-id}/ to experiments/{exp-id}/{run-id}/

  • Generates a run ID based on the directory’s modification time

  • Removes the empty xp/ directory

  • Creates a broken symlink xp -> /experimaestro_v2_migrated_workspace_do_not_use_v1

Note

Migration is optional. The TUI, web UI, and CLI commands work with both layouts. However, new experiments always use the v2 layout.

Job Execution

When a job is started by the scheduler, several files are created and used to coordinate execution and track state. The {scriptname} is derived from the task identifier (last component after the last ., e.g., MyTask from my.module.MyTask).

Locking

The lock file at jobs/{task-id}/{job-hash}/.experimaestro/{scriptname}.lock ensures exclusive access to a job. Both the scheduler and the job process use this lock at different phases:

  1. Scheduler lock phase: The scheduler acquires the lock before setting up the job directory, writing status.json, launching the process, and writing the PID file. The lock is released after the process is launched.

  2. Process lock phase: The job process acquires the same lock when it starts executing the task. It holds the lock until the task completes and the terminal marker (.done/.failed) is written.

There is a brief gap between these two phases where the lock is not held but the job is still active.

PID File

The file {scriptname}.pid is written by the scheduler (inside aio_run()) while it still holds the job lock. It contains a JSON object describing the process:

{"type": "local", "pid": 12345}

The type field identifies the process handler (e.g., local, ssh, slurm). Process liveness is checked using the launcher-independent Process abstraction (Process.fromDefinition()), not directly via psutil, so that it works across different launchers.

Terminal Markers

When a job finishes, the job process writes one of:

  • {scriptname}.done — job succeeded

  • {scriptname}.failed — job failed, contains a JSON object with failure details (e.g., {"reason": "FAILED"})

The job process also writes a final status.json with updated timestamps.

If the job is killed externally (e.g., SLURM scancel, OOM killer), these markers are not written. In that case, cleanup is handled by the scheduler (if still running) or by a later experimaestro process.

Event Files

While a job is running, state change events are written to temporary event files in .events/jobs/{task-id}/event-{job-id}-{count}.jsonl. The scheduler watches this directory to track job progress in real time.

When a job completes, these temporary event files are archived to the permanent location at jobs/{task-id}/{job-hash}/.experimaestro/events/ and then deleted from .events/.

Important

The cleanup process that consolidates orphaned event files checks that a job is not active before deleting its event files. A job is considered active if:

  • Its lock is held, OR

  • Its PID file references a running process, OR

  • No terminal marker (.done/.failed) exists

State Tracking

Experimaestro uses a filesystem-based state tracking system instead of a database. This approach is more robust on network filesystems (NFS) and easier to inspect.

Status File (status.json)

Each experiment run has a status.json file containing the experiment metadata and service state:

{
  "version": 1,
  "experiment_id": "my-experiment",
  "run_id": "20250108_143022",
  "events_count": 42,
  "hostname": "compute-node-01",
  "started_at": "2025-01-08T14:30:22.123456",
  "ended_at": "2025-01-08T15:45:10.654321",
  "status": "completed",
  "finished_jobs": 10,
  "failed_jobs": 1,
  "services": {
    "<service_id>": {
      "service_id": "...",
      "description": "...",
      "class": "mypackage.services.MyService",
      "state_dict": {}
    }
  }
}

Note

Job details are stored separately in jobs.jsonl rather than in status.json. This reduces memory usage and allows for efficient streaming of job information.

Jobs File (jobs.jsonl)

Lightweight job information is stored in a separate JSONL file (one JSON object per line):

{"job_id": "abc123", "task_id": "my.task.Train", "tags": {"experiment": "v1"}, "timestamp": 1736343025.0}
{"job_id": "def456", "task_id": "my.task.Evaluate", "tags": {}, "timestamp": 1736343030.5}

Each record contains:

  • job_id: Unique job identifier

  • task_id: Task type identifier

  • tags: Dictionary of job tags

  • timestamp: When the job was submitted (Unix timestamp)

Event Log

While an experiment is running, events are streamed to a JSONL file at .events/experiments/events-{count}@{experiment-id}.jsonl:

{"type": "job_submitted", "job_id": "abc123", "task_id": "my.task", "timestamp": 1736343025.0}
{"type": "job_state_changed", "job_id": "abc123", "state": "running", "timestamp": 1736343030.5}
{"type": "service_added", "service_id": "tensorboard", "description": "TensorBoard", "timestamp": 1736343035.0}

When the experiment completes, events are consolidated into status.json and the event log is cleaned up.