# Workspace Layout

A workspace is a directory where experimaestro stores all experiment data, job
outputs, and metadata. This page describes the directory structure and explains
the purpose of each component.

## Directory Structure

```
WORKSPACE_DIR/
├── .__experimaestro__              # Marker file indicating this is a workspace
├── experiments/                     # Experiment run directories
│   └── {experiment-id}/            # One directory per experiment
│       ├── lock                    # Lock file (prevents concurrent runs)
│       └── {run-id}/               # One directory per run
│           ├── environment.json    # Python environment and git info
│           ├── status.json         # Experiment metadata and services state
│           ├── jobs.jsonl          # Lightweight job information (one per line)
│           ├── jobs/               # Symlinks to job directories
│           ├── results/            # Saved experiment results
│           └── data/               # Serialized configurations
├── jobs/                           # Actual job output directories
│   └── {task-type-id}/
│       └── {job-hash}/
│           ├── params.json         # Job parameters
│           ├── {scriptname}.pid    # Process info (PID, type) while running
│           ├── {scriptname}.done   # Marker file when job succeeds
│           ├── {scriptname}.failed # Marker file when job fails (with reason)
│           ├── {scriptname}.out    # Standard output
│           ├── {scriptname}.err    # Standard error
│           ├── locks.json          # Dynamic dependency locks (tokens)
│           └── .experimaestro/
│               ├── status.json     # Job state, timestamps, progress
│               ├── {scriptname}.lock  # Job lock file
│               ├── task-outputs.jsonl # Dynamic task output events
│               └── events/         # Permanent event storage (after archival)
│                   └── event-{count}.jsonl
├── .events/                        # Temporary event files (watched by scheduler)
│   ├── experiments/
│   │   └── {experiment-id}/
│   │       ├── current             # Symlink to current run directory
│   │       └── events-{count}.jsonl
│   └── jobs/
│       └── {task-type-id}/
│           └── event-{job-id}-{count}.jsonl
├── partials/                       # Shared partial directories
│   └── {task-type-id}/
│       └── {partial-name}/
│           └── {partial-hash}/
├── config/                         # Configuration cache
└── .experimaestro/
    └── experiments/
        ├── events-{count}@{experiment-id}.jsonl  # Event log (active experiments)
        └── {experiment-id}                       # Symlink to current run
```

## Run ID Format

Each experiment run is identified by a **run ID** based on the timestamp when
the experiment started:

- Format: `YYYYMMDD_HHMMSS` (e.g., `20250108_143022`)
- If multiple runs start within the same second, a suffix is added:
  `20250108_143022.1`, `20250108_143022.2`, etc.

This format ensures:
- Runs are naturally sorted chronologically
- Each run has a unique identifier
- Run IDs are human-readable

## Experiment Lock

The lock file at `experiments/{experiment-id}/lock` prevents multiple instances
of the same experiment from running simultaneously. When you start an experiment:

1. Experimaestro acquires an exclusive lock on this file
2. If another process holds the lock, a warning is displayed with the hostname
   of the holder (if available)
3. The process waits until the lock is released

This ensures data integrity and prevents race conditions.

## Environment Information

The `environment.json` file captures the complete runtime environment when the
experiment starts:

```
{
  "python_version": "3.10.12",
  "packages": {
    "experimaestro": "2.0.0",
    "torch": "2.1.0",
    ...
  },
  "editable_packages": {
    "my-project": {
      "version": "0.1.0",
      "path": "/home/user/my-project",
      "git": {
        "branch": "main",
        "commit": "abc123...",
        "dirty": false
      }
    }
  },
  "projects": [...],
  "run": {
    "hostname": "compute-node-01",
    "started_at": "2025-01-08T14:30:22",
    "ended_at": "2025-01-08T15:45:10",
    "status": "completed"
  }
}
```

This information is essential for **reproducibility** - you can recreate the
exact environment that was used for any experiment run.

## History Cleanup

Experimaestro automatically manages experiment history to prevent disk space
accumulation. The cleanup behavior is controlled by settings:

```yaml
# In ~/.config/experimaestro/settings.yaml

# Global defaults
history:
  max_done: 5      # Keep last 5 successful runs per experiment
  max_failed: 1    # Keep last 1 failed run per experiment

workspaces:
  - id: my-workspace
    path: ~/experiments
    # Override for this workspace
    history:
      max_done: 10
      max_failed: 2
```

**Cleanup rules:**
- When an experiment **succeeds**, all previous failed runs are removed
- Only the most recent `max_done` successful runs are kept
- Only the most recent `max_failed` failed runs are kept
- Runs with unknown status are never automatically deleted

## v1 Experiment Layout

Experimaestro v2 can read experiments created with v1 (the `xp/` directory layout).
The state provider automatically detects and handles both layouts:

**v1 layout** (legacy):
```
WORKSPACE_DIR/
└── xp/
    └── {experiment-id}/
        ├── jobs/           # Symlinks to job directories (current run)
        └── jobs.bak/       # Symlinks to job directories (previous run)
```

**v2 layout** (current):
```
WORKSPACE_DIR/
└── experiments/
    └── {experiment-id}/
        └── {run-id}/
            ├── status.json
            └── jobs.jsonl
```

### Migration (Optional)

If you want to migrate v1 experiments to v2 layout:

```bash
# Preview what will be migrated
experimaestro migrate v1-to-v2 /path/to/workspace --dry-run

# Perform the migration
experimaestro migrate v1-to-v2 /path/to/workspace

# Keep remaining files (renamed to xp_MIGRATED_TO_V2)
experimaestro migrate v1-to-v2 /path/to/workspace --keep-old
```

The migration:
- Moves each experiment from `xp/{exp-id}/` to `experiments/{exp-id}/{run-id}/`
- Generates a run ID based on the directory's modification time
- Removes the empty `xp/` directory
- Creates a broken symlink `xp -> /experimaestro_v2_migrated_workspace_do_not_use_v1`

:::{note}
Migration is optional. The TUI, web UI, and CLI commands work with both layouts.
However, new experiments always use the v2 layout.
:::

## Job Execution

When a job is started by the scheduler, several files are created and used
to coordinate execution and track state. The `{scriptname}` is derived from
the task identifier (last component after the last `.`, e.g., `MyTask` from
`my.module.MyTask`).

### Locking

The lock file at `jobs/{task-id}/{job-hash}/.experimaestro/{scriptname}.lock`
ensures exclusive access to a job. Both the scheduler and the job process use
this lock at different phases:

1. **Scheduler lock phase**: The scheduler acquires the lock before setting up
   the job directory, writing `status.json`, launching the process, and writing
   the PID file. The lock is released after the process is launched.
2. **Process lock phase**: The job process acquires the same lock when it starts
   executing the task. It holds the lock until the task completes and the
   terminal marker (`.done`/`.failed`) is written.

There is a brief gap between these two phases where the lock is not held but
the job is still active.

### PID File

The file `{scriptname}.pid` is written by the scheduler (inside `aio_run()`)
while it still holds the job lock. It contains a JSON object describing the
process:

```json
{"type": "local", "pid": 12345}
```

The `type` field identifies the process handler (e.g., `local`, `ssh`, `slurm`).
Process liveness is checked using the launcher-independent `Process` abstraction
(`Process.fromDefinition()`), not directly via `psutil`, so that it works
across different launchers.

### Terminal Markers

When a job finishes, the job process writes one of:
- `{scriptname}.done` — job succeeded
- `{scriptname}.failed` — job failed, contains a JSON object with failure details
  (e.g., `{"reason": "FAILED"}`)

The job process also writes a final `status.json` with updated timestamps.

If the job is killed externally (e.g., SLURM `scancel`, OOM killer), these
markers are **not** written. In that case, cleanup is handled by the scheduler
(if still running) or by a later experimaestro process.

### Event Files

While a job is running, state change events are written to temporary event
files in `.events/jobs/{task-id}/event-{job-id}-{count}.jsonl`. The scheduler
watches this directory to track job progress in real time.

When a job completes, these temporary event files are archived to the permanent
location at `jobs/{task-id}/{job-hash}/.experimaestro/events/` and then deleted
from `.events/`.

:::{important}
The cleanup process that consolidates orphaned event files checks that a job is
not active before deleting its event files. A job is considered active if:
- Its lock is held, OR
- Its PID file references a running process, OR
- No terminal marker (`.done`/`.failed`) exists
:::

## State Tracking

Experimaestro uses a filesystem-based state tracking system instead of a database.
This approach is more robust on network filesystems (NFS) and easier to inspect.

### Status File (`status.json`)

Each experiment run has a `status.json` file containing the experiment metadata and service state:

```json
{
  "version": 1,
  "experiment_id": "my-experiment",
  "run_id": "20250108_143022",
  "events_count": 42,
  "hostname": "compute-node-01",
  "started_at": "2025-01-08T14:30:22.123456",
  "ended_at": "2025-01-08T15:45:10.654321",
  "status": "completed",
  "finished_jobs": 10,
  "failed_jobs": 1,
  "services": {
    "<service_id>": {
      "service_id": "...",
      "description": "...",
      "class": "mypackage.services.MyService",
      "state_dict": {}
    }
  }
}
```

:::{note}
Job details are stored separately in `jobs.jsonl` rather than in `status.json`.
This reduces memory usage and allows for efficient streaming of job information.
:::

### Jobs File (`jobs.jsonl`)

Lightweight job information is stored in a separate JSONL file (one JSON object per line):

```json
{"job_id": "abc123", "task_id": "my.task.Train", "tags": {"experiment": "v1"}, "timestamp": 1736343025.0}
{"job_id": "def456", "task_id": "my.task.Evaluate", "tags": {}, "timestamp": 1736343030.5}
```

Each record contains:
- `job_id`: Unique job identifier
- `task_id`: Task type identifier
- `tags`: Dictionary of job tags
- `timestamp`: When the job was submitted (Unix timestamp)

### Event Log

While an experiment is running, events are streamed to a JSONL file at
`.events/experiments/events-{count}@{experiment-id}.jsonl`:

```json
{"type": "job_submitted", "job_id": "abc123", "task_id": "my.task", "timestamp": 1736343025.0}
{"type": "job_state_changed", "job_id": "abc123", "state": "running", "timestamp": 1736343030.5}
{"type": "service_added", "service_id": "tensorboard", "description": "TensorBoard", "timestamp": 1736343035.0}
```

When the experiment completes, events are consolidated into `status.json` and
the event log is cleaned up.

### Current Run Symlink

A symlink at `.events/experiments/{experiment-id}/current` points to the current
(or most recent) run directory. This allows quick access to the active run
without scanning all run directories

## Related Commands

```bash
# List experiments in a workspace
experimaestro experiments --workdir /path/to/workspace list

# Monitor experiments (TUI)
experimaestro experiments --workdir /path/to/workspace monitor --console

# Monitor experiments (Web UI)
experimaestro experiments --workdir /path/to/workspace monitor --port 12345

# Check for orphan jobs
experimaestro orphans /path/to/workspace
```