Workspace Layout
A workspace is a directory where experimaestro stores all experiment data, job outputs, and metadata. This page describes the directory structure and explains the purpose of each component.
Directory Structure
WORKSPACE_DIR/
├── .__experimaestro__ # Marker file indicating this is a workspace
├── experiments/ # Experiment run directories
│ └── {experiment-id}/ # One directory per experiment
│ ├── lock # Lock file (prevents concurrent runs)
│ └── {run-id}/ # One directory per run
│ ├── environment.json # Python environment and git info
│ ├── status.json # Experiment metadata and services state
│ ├── jobs.jsonl # Lightweight job information (one per line)
│ ├── jobs/ # Symlinks to job directories
│ ├── results/ # Saved experiment results
│ └── data/ # Serialized configurations
├── jobs/ # Actual job output directories
│ └── {task-type-id}/
│ └── {job-hash}/
│ ├── params.json # Job parameters
│ ├── {scriptname}.pid # Process info (PID, type) while running
│ ├── {scriptname}.done # Marker file when job succeeds
│ ├── {scriptname}.failed # Marker file when job fails (with reason)
│ ├── {scriptname}.out # Standard output
│ ├── {scriptname}.err # Standard error
│ ├── locks.json # Dynamic dependency locks (tokens)
│ └── .experimaestro/
│ ├── status.json # Job state, timestamps, progress
│ ├── {scriptname}.lock # Job lock file
│ ├── task-outputs.jsonl # Dynamic task output events
│ └── events/ # Permanent event storage (after archival)
│ └── event-{count}.jsonl
├── .events/ # Temporary event files (watched by scheduler)
│ ├── experiments/
│ │ └── {experiment-id}/
│ │ ├── current # Symlink to current run directory
│ │ └── events-{count}.jsonl
│ └── jobs/
│ └── {task-type-id}/
│ └── event-{job-id}-{count}.jsonl
├── partials/ # Shared partial directories
│ └── {task-type-id}/
│ └── {partial-name}/
│ └── {partial-hash}/
├── config/ # Configuration cache
└── .experimaestro/
└── experiments/
├── events-{count}@{experiment-id}.jsonl # Event log (active experiments)
└── {experiment-id} # Symlink to current run
Run ID Format
Each experiment run is identified by a run ID based on the timestamp when the experiment started:
Format:
YYYYMMDD_HHMMSS(e.g.,20250108_143022)If multiple runs start within the same second, a suffix is added:
20250108_143022.1,20250108_143022.2, etc.
This format ensures:
Runs are naturally sorted chronologically
Each run has a unique identifier
Run IDs are human-readable
Experiment Lock
The lock file at experiments/{experiment-id}/lock prevents multiple instances
of the same experiment from running simultaneously. When you start an experiment:
Experimaestro acquires an exclusive lock on this file
If another process holds the lock, a warning is displayed with the hostname of the holder (if available)
The process waits until the lock is released
This ensures data integrity and prevents race conditions.
Environment Information
The environment.json file captures the complete runtime environment when the
experiment starts:
{
"python_version": "3.10.12",
"packages": {
"experimaestro": "2.0.0",
"torch": "2.1.0",
...
},
"editable_packages": {
"my-project": {
"version": "0.1.0",
"path": "/home/user/my-project",
"git": {
"branch": "main",
"commit": "abc123...",
"dirty": false
}
}
},
"projects": [...],
"run": {
"hostname": "compute-node-01",
"started_at": "2025-01-08T14:30:22",
"ended_at": "2025-01-08T15:45:10",
"status": "completed"
}
}
This information is essential for reproducibility - you can recreate the exact environment that was used for any experiment run.
History Cleanup
Experimaestro automatically manages experiment history to prevent disk space accumulation. The cleanup behavior is controlled by settings:
# In ~/.config/experimaestro/settings.yaml
# Global defaults
history:
max_done: 5 # Keep last 5 successful runs per experiment
max_failed: 1 # Keep last 1 failed run per experiment
workspaces:
- id: my-workspace
path: ~/experiments
# Override for this workspace
history:
max_done: 10
max_failed: 2
Cleanup rules:
When an experiment succeeds, all previous failed runs are removed
Only the most recent
max_donesuccessful runs are keptOnly the most recent
max_failedfailed runs are keptRuns with unknown status are never automatically deleted
v1 Experiment Layout
Experimaestro v2 can read experiments created with v1 (the xp/ directory layout).
The state provider automatically detects and handles both layouts:
v1 layout (legacy):
WORKSPACE_DIR/
└── xp/
└── {experiment-id}/
├── jobs/ # Symlinks to job directories (current run)
└── jobs.bak/ # Symlinks to job directories (previous run)
v2 layout (current):
WORKSPACE_DIR/
└── experiments/
└── {experiment-id}/
└── {run-id}/
├── status.json
└── jobs.jsonl
Migration (Optional)
If you want to migrate v1 experiments to v2 layout:
# Preview what will be migrated
experimaestro migrate v1-to-v2 /path/to/workspace --dry-run
# Perform the migration
experimaestro migrate v1-to-v2 /path/to/workspace
# Keep remaining files (renamed to xp_MIGRATED_TO_V2)
experimaestro migrate v1-to-v2 /path/to/workspace --keep-old
The migration:
Moves each experiment from
xp/{exp-id}/toexperiments/{exp-id}/{run-id}/Generates a run ID based on the directory’s modification time
Removes the empty
xp/directoryCreates a broken symlink
xp -> /experimaestro_v2_migrated_workspace_do_not_use_v1
Note
Migration is optional. The TUI, web UI, and CLI commands work with both layouts. However, new experiments always use the v2 layout.
Job Execution
When a job is started by the scheduler, several files are created and used
to coordinate execution and track state. The {scriptname} is derived from
the task identifier (last component after the last ., e.g., MyTask from
my.module.MyTask).
Locking
The lock file at jobs/{task-id}/{job-hash}/.experimaestro/{scriptname}.lock
ensures exclusive access to a job. Both the scheduler and the job process use
this lock at different phases:
Scheduler lock phase: The scheduler acquires the lock before setting up the job directory, writing
status.json, launching the process, and writing the PID file. The lock is released after the process is launched.Process lock phase: The job process acquires the same lock when it starts executing the task. It holds the lock until the task completes and the terminal marker (
.done/.failed) is written.
There is a brief gap between these two phases where the lock is not held but the job is still active.
PID File
The file {scriptname}.pid is written by the scheduler (inside aio_run())
while it still holds the job lock. It contains a JSON object describing the
process:
{"type": "local", "pid": 12345}
The type field identifies the process handler (e.g., local, ssh, slurm).
Process liveness is checked using the launcher-independent Process abstraction
(Process.fromDefinition()), not directly via psutil, so that it works
across different launchers.
Terminal Markers
When a job finishes, the job process writes one of:
{scriptname}.done— job succeeded{scriptname}.failed— job failed, contains a JSON object with failure details (e.g.,{"reason": "FAILED"})
The job process also writes a final status.json with updated timestamps.
If the job is killed externally (e.g., SLURM scancel, OOM killer), these
markers are not written. In that case, cleanup is handled by the scheduler
(if still running) or by a later experimaestro process.
Event Files
While a job is running, state change events are written to temporary event
files in .events/jobs/{task-id}/event-{job-id}-{count}.jsonl. The scheduler
watches this directory to track job progress in real time.
When a job completes, these temporary event files are archived to the permanent
location at jobs/{task-id}/{job-hash}/.experimaestro/events/ and then deleted
from .events/.
Important
The cleanup process that consolidates orphaned event files checks that a job is not active before deleting its event files. A job is considered active if:
Its lock is held, OR
Its PID file references a running process, OR
No terminal marker (
.done/.failed) exists
State Tracking
Experimaestro uses a filesystem-based state tracking system instead of a database. This approach is more robust on network filesystems (NFS) and easier to inspect.
Status File (status.json)
Each experiment run has a status.json file containing the experiment metadata and service state:
{
"version": 1,
"experiment_id": "my-experiment",
"run_id": "20250108_143022",
"events_count": 42,
"hostname": "compute-node-01",
"started_at": "2025-01-08T14:30:22.123456",
"ended_at": "2025-01-08T15:45:10.654321",
"status": "completed",
"finished_jobs": 10,
"failed_jobs": 1,
"services": {
"<service_id>": {
"service_id": "...",
"description": "...",
"class": "mypackage.services.MyService",
"state_dict": {}
}
}
}
Note
Job details are stored separately in jobs.jsonl rather than in status.json.
This reduces memory usage and allows for efficient streaming of job information.
Jobs File (jobs.jsonl)
Lightweight job information is stored in a separate JSONL file (one JSON object per line):
{"job_id": "abc123", "task_id": "my.task.Train", "tags": {"experiment": "v1"}, "timestamp": 1736343025.0}
{"job_id": "def456", "task_id": "my.task.Evaluate", "tags": {}, "timestamp": 1736343030.5}
Each record contains:
job_id: Unique job identifiertask_id: Task type identifiertags: Dictionary of job tagstimestamp: When the job was submitted (Unix timestamp)
Event Log
While an experiment is running, events are streamed to a JSONL file at
.events/experiments/events-{count}@{experiment-id}.jsonl:
{"type": "job_submitted", "job_id": "abc123", "task_id": "my.task", "timestamp": 1736343025.0}
{"type": "job_state_changed", "job_id": "abc123", "state": "running", "timestamp": 1736343030.5}
{"type": "service_added", "service_id": "tensorboard", "description": "TensorBoard", "timestamp": 1736343035.0}
When the experiment completes, events are consolidated into status.json and
the event log is cleaned up.
Current Run Symlink
A symlink at .events/experiments/{experiment-id}/current points to the current
(or most recent) run directory. This allows quick access to the active run
without scanning all run directories