Launchers
Types
direct (default)
By default, jobs are launched directly by the scheduler using python scripts.
DirectLauncher
Bases: Launcher
Slurm (since 0.8.7)
The Slurm workload manager launcher is supported.
It is possible to use different settings for different jobs by using the config
method of the launcher
from experimaestro.launchers.slurm import SlurmLauncher
launcher = SlurmLauncher(nodes=1)
gpulauncher = launcher.config(gpu_per_node=1)
with experiment(launcher=launcher):
# Default
mytask().submit()
# If needed, options can be used
mytask().submit(launcher=gpulauncher)
To use launcher configuration files, one can use an automatic convertion tool
scontrol show nodes | experimaestro launchers slurm convert
SlurmOptions
dataclass
account: Optional[str] = None
class-attribute
instance-attribute
The account for launching the job
constraint: Optional[str] = None
class-attribute
instance-attribute
Logic expression on node features (as defined by the administator)
cpus_per_task: Optional[str] = None
class-attribute
instance-attribute
Number of cpus requested per task
exclude: Optional[str] = None
class-attribute
instance-attribute
List of hosts to exclude
gpus: Optional[int] = None
class-attribute
instance-attribute
Number of GPUs
gpus_per_node: Optional[int] = None
class-attribute
instance-attribute
Number of GPUs per node
mem: Optional[str] = None
class-attribute
instance-attribute
Requested memory on the node (in megabytes by default)
mem_per_gpu: Optional[str] = None
class-attribute
instance-attribute
Requested memory per allocated GPU (size with units: K, M, G, or T)
nodelist: Optional[str] = None
class-attribute
instance-attribute
Request a specific list of hosts
nodes: Optional[int] = 1
class-attribute
instance-attribute
Number of requested nodes
partition: Optional[str] = None
class-attribute
instance-attribute
The requested partition
qos: Optional[str] = None
class-attribute
instance-attribute
The requested Quality of Service
time: Optional[str] = None
class-attribute
instance-attribute
Requested time
args()
Returns the corresponding options
format_time(duration_s)
staticmethod
Format time for the SLURM option
Parameters: |
|
---|
Returns: |
|
---|
SlurmLauncher
Bases: Launcher
Slurm workload manager launcher
https://slurm.schedmd.com/documentation.html
__init__(*, connector=None, options=None, interval=60, main=None, launcherenv=None, binpath='/usr/bin')
Arguments: main: Main slurm launcher to avoid launching too many polling jobs interval: seconds between polling job statuses
config(**kwargs)
Returns a new Slurm launcher with the given configuration
key()
Returns a dictionary characterizing this launcher when calling sacct/etc
processbuilder()
Returns the process builder for this launcher
By default, returns the associated connector builder
scriptbuilder()
Returns the script builder
We assume *nix, but should be changed to PythonScriptBuilder when working
Launcher file (since 1.2.4)
The most flexible way to define potential launchers is to use a launchers.py
file within the configuration directory.
from typing import Set
from experimaestro.launcherfinder import (
HostRequirement,
HostSpecification,
CudaSpecification,
CPUSpecification,
)
from experimaestro.launchers.slurm import SlurmLauncher, SlurmOptions
from experimaestro.connectors.local import LocalConnector
def find_launcher(requirements: HostRequirement, tags: Set[str] = set()):
"""Find a launcher"""
if match := requirements.match(HostSpecification(cuda=[])):
# No GPU: run directly
return LocalConnector.instance()
if match := requirements.match(
HostSpecification(
max_duration=100 * 3600,
cpu=CPUSpecification(cores=32, memory=129 * (1024**3)),
cuda=[CudaSpecification(memory=24 * (1024**3)) for _ in range(8)],
)
):
if len(match.requirement.cuda_gpus) > 0:
return SlurmLauncher(
connector=LocalConnector.instance(),
options=SlurmOptions(gpus_per_node=len(match.requirement.cuda_gpus)),
)
# Could not find a host
return None
Launcher configuration file (since 0.11)
This option is deprecated since it is less flexible than the previous one, and the added complexity is not worth it
In order to automate the process of choosing the right launcher, a launchers.py
configuration file can be written.
# Finds a launcher so that we get 2 CUDA GPUs with 14G of memory (at least) on each
from experimaestro.launcherfinder import cuda_gpu, find_launcher
gpulauncher = find_launcher(cuda_gpu(mem="14G") * 2)
Simple strings can also be parsed (for configuration files)
from experimaestro.launcherfinder import find_launcher
find_launcher("""duration=4 days & cuda(mem=4G) * 2 & cpu(mem=400M, cores=4)""")
Tags
Tags can be used to filter out some launchers
from experimaestro.launcherfinder import find_launcher
find_launcher("""duration=4 days & cuda(mem=4G) * 2 & cpu(mem=400M, cores=4)""", tags=["slurm"])
slurm
(see example below).
Search process
Launcher groups are sorted by decreasing weights and filtered by group before the search. Then, for each launcher group, experimaestro searches for the first matching launcher (details are type-specific).
Example of a configuration
This configurations contains four launcher groups (two local, two through slurm).
# --- Local launchers
local:
- # Standard launcher for small tasks
connector: local
weight: 5
# Describes the available CPUs
cpu: { cores: 40, memory: 1G }
- # Intensive launcher with more memory and GPU
connector: local
weight: 4
# Use a token to avoid running too many tasks
tokens:
localtoken: 1
cpu: { cores: 40, memory: 8G }
gpu:
- model: GTX1080
count: 1
memory: 8116MiB
# --- Slurm launchers
slurm:
# We can use fully manual SLURM configuration
-
# ID for this launcher configuration
id: manual
# slurm clients are on the local machine
connector: local
# Tags for filtering the launcher configurations
tags: [slurm]
# Weight to select a launcher configuration (higher, better)
weight: 3
# Describes the GPU features and link them to the two
# possible properties (memory and number of GPUs)
features_regex:
# GPU3 means "3 GPUs" on the node
- GPU(?P<cuda_count>\d+)
# GPUM32G means "32G" of GPU memory
- GPUM(?P<cuda_memory>\d+G)
# Set to false if memory constraints cannot be
# used (uses mem_per_cpu in that case to reserve the
# appropriate number of cores)
use_memory_contraint: true
# Quality of service
qos:
qos_gpu-t3:
# Jobs have 20h hours max to complete
max_duration: 20h
# We need to reserver at least one GPU
min_gpu: 1
# Priority increase for this QoS
priority: 1
qos_gpu-t4:
max_duration: 100h
min_gpu: 1
configuration:
cpu:
# Memory allocated for one core
mem_per_cpu: 2048M
gpu:
# At least 70% of the memory should be requested
# (from version 0.11.8)
# For instance, if the GPU has 64G, we won't target it
# if we request less than 44.8G (= 70% of 64G)
min_mem_ratio: 0.7
partitions:
# Partition "big GPUs"
biggpus:
# has two types of nodes
nodes:
- # Nodes yep/yop
hosts: [yop, yep]
# Associated features
features: [GPU3, GPUM48G]
- hosts: [yip, yup, yap]
features: [GPU2, GPUM24G]
# Partition "Small GPUs"
smallgpus:
nodes:
- hosts: [alpha, beta, gamma, delta]
features: [GPU2, GPUM24G]
gpu_p4:
# QoS that must be used with this partition
qos: [qos_gpu-t3, qos_gpu-t4]
# Accounts that can must used for this partition
accounts: [iea@a100]
# Default node configuration
configuration:
gpu:
count: 8
model: A100
memory: 40GiB
nodes:
- count: 0
features:
- Tesla
- a100
priority: 1
# We can also use SLURM for semi-automatic configuration
- id: auto
connector: local
tags: [slurm]
# Describes the GPU features and link them to the two
# possible properties (memory and number of GPUs)
features_regex:
- GPU(?P<cuda_count>\d+)
- GPUM(?P<cuda_memory>\d+G)
partitions:
# Disable the "heavy" partition
heavy: { disabled: true }
# Use `sinfo` to ask partition/node details (e.g. name and features)
query_slurm: true