Configurations
The most important concept in Experimaestro is that of a configuration. In Experimaestro, a configuration object is a fundamental concept used to specify parameters and settings for tasks and experiments. It acts as a structured way to input the necessary details to execute a task or a series of tasks. Here's a general description of what a configuration object in Experimaestro might encompass:
-
Parameter Definition: The configuration object defines the parameters needed for a task or experiment. These parameters can include data file paths, numerical values, strings, and other types of inputs that are essential for the execution of the task.
-
Parameter Types and Validation: Each parameter in the configuration object can have a specific type, such as integer, float, string, or more complex data types. The configuration object can include validation rules to ensure that the parameters provided are in the correct format and within expected ranges.
-
Default Values: For some parameters, the configuration object can specify default values. This is useful in cases where a parameter is optional or where a common default value is typically used.
-
Documentation: The configuration object can include documentation for each parameter, explaining its purpose and how it should be used. This documentation is crucial for making the configuration user-friendly, especially in complex experiments.
-
Hierarchy and Nesting: In complex tasks, the configuration object can be hierarchical or nested. This means that a configuration object can contain other configuration objects, allowing for the organization of parameters in a structured manner.
-
Linking to Tasks: The configuration object is typically linked to specific tasks or experiments. When a task is executed, it retrieves the necessary parameters from the associated configuration object.
-
Flexibility and Extensibility: Configuration objects are designed to be flexible and extensible, allowing users to add new parameters or modify existing ones as the requirements of the task evolve.
-
Serialization: Configuration objects are often serializable, meaning they can be saved to a file and loaded back. This is important for reproducibility and for sharing configurations between users or for future use.
In practical use, a configuration object acts as a bridge between the user (or another system) and the execution environment, ensuring that all the necessary inputs are provided and validated before the task is run. This structured approach aids in automating and scaling experiments, as well as in ensuring their reproducibility.
Configuration identifiers
A configuration identifier in the context of systems like Experimaestro is a unique identifier associated with a specific configuration object. This identifier plays a crucial role in managing and referencing configurations, especially in complex systems where multiple configurations are used. Here's a detailed description:
-
Uniqueness: A configuration identifier is unique for each configuration instance. This uniqueness ensures that each configuration can be distinctly identified and referenced, avoiding confusion or overlap with other configurations.
-
MD5 Hashes: Experimaestro utilizes MD5 hashes as configuration identifiers. These hashes are unique to each configuration, ensuring a distinct and consistent identifier for every set of parameters.
-
Run-Once Guarantee: The unique MD5 hash identifiers ensure that each task associated with a specific configuration is executed only once. This is particularly important in avoiding redundant computations and ensuring the efficiency of the workflow.
How is the identifier computed?
The principale is the following. Any value can be associated with a unique byte string: the byte string is obtained by outputting the type of the value (e.g. string, ir.adhoc.dataset) and the value itself as a binary string. A special handling of configurations and tasks (objects) is performed by sorting keys in ascending lexicographic order, thus ensuring the uniqueness of the representation.
Moreover
- Default values are removed (e.g. k1 when set to 0.9). This allows to handle the situation where one adds a new experimental parameter (e.g. a new loss component). In that case, using a default parameter allows to add this parameter without invalidating all the previously ran experiments.
- Ignored values are removed (e.g. the number of threads when indexing, the path where the index is stored)
Defining a configuration
A configuration is defined whenever an object derives from Config
.
When an identifier is not given, it is computed as __module__.__qualname__
. In that case,
it is possible to shorten the definition using the Config
class as a base class.
Example
from experimaestro import Param, Config
class MyModel(Config):
__xpmid__ = "my.model"
gamma: Param[float]
defines a configuration with name my.model
and one argument gamma
that has the type float
.
__xpmid__
can also be a class method to generate dynamic ids for all descendant configurations
When __xpmid__
is missing, the qualified name is used.
Object hierarchy
When deriving B
from Config
, experimaestro creates two auxilliary types:
- A configuration object
A.Config
deriving fromTypeConfig
andA
- A value object deriving from
A
For a class B
deriving from A
, B.Value
from B
and A.Value
Deprecating a configuration or attributes
When a configuration is moved (or equivalently its __xpmid__
changed), its signature
changes, and thus the same tasks can be run twice. To avoid this, use the @deprecate
annotation.
Example
from experimaestro import Param, Config, deprecate
class NewConfiguration(Config):
pass
@deprecate
class OldConfiguration(NewConfiguration):
# Only pass is allowed here
pass
It is possible to deprecate a parameter or option:
Example
from experimaestro import Param, Config, deprecate
class Learning(Config):
losses: Param[List[Loss]] = []
@deprecate
def loss(self, value):
# Checking that the new param is not used
assert len(self.losses) == 0
# We allow several losses to be defined now
self.losses.append(value)
Warning the signature will change when deprecating attributes
To fix the identifiers, one can use the deprecated
command. This
will create symbolic links so that old jobs are preserved and
re-used.
experimaestro deprecated list WORKDIR
Object life cycle
Initialisation
During task execution, the objects are constructed following these steps:
- The object is constructed using
self.__init__()
- The attributes are set (e.g.
gamma
in the example above) self.__post_init__()
is called (if the method exists)- Pre-tasks are ran (if any, see below)
Sometimes, it is necessary to postpone a part of the initialization of a configuration
object because it depends on an external processing. In this case, the initializer
decorator can
be used:
from experimaestro import Config, initializer
class MyConfig(Config):
# The decorator ensures the initializer can only be called once
@initializer
def initialize(self, ...):
# Do whatever is needed
pass
Initialization tasks
Sometimes, it is necessary to restore an object state from disk, and we want
to separate the loading mechanism from the configuration logic; in that case,
LightweightTask
(a Config
which must be subclassed) can be used.
Initialization tasks
Initialization tasks can only be used when submitting a task. They are not associated with any configuration or task (as pre-tasks), and as such their use is more explicit (and leads to less errors and bugs).
To take the example of a model learner task, it would return a model loader only:
class ModelLearner(Task):
model: Param[model]
def task_outputs(self, dep):
return dep(ModelLoader(model=model))
When using the model:
model_loader = learner.submit()
Evaluate(model=model).submit(init_tasks=[model_loader])
Pre-tasks (deprecated)
Pre-tasks can be associated with a Configuration and run automatically when the configuration is loaded. Given their implicit nature, they are now deprecated in favor of initialization tasks.
from experimaestro import Config, LightweightTask
class Model(Config):
...
class ModelLoader(LightweightTask):
model: Param[Model]
def execute(self):
# Access the configuration through self.config
self.model.initialized = True
Lightweight tasks are executed automatically by using the add_pretasks
method of a configuration object.
class ModelLearner(Task):
model: Param[model]
def task_outputs(self, dep):
model = copyconfig(self.model)
return model.add_pretasks(dep(ModelLoader(model=model)))
When initializing a single Config
, the SerializationLWTask
,
a child class of LightweightTask
, has a parameter value
(of type Config
).
The typical use case is when the state can be recovered from disk. In that case,
PathSerializationLWTask
can be used -- it is a lightweight task configuration
object with two fields (value
and path
).
from experimaestro import Config, LightweightTask
class Model(Config):
...
class SerializedModel(PathSerializationLWTask):
def execute(self):
# Loads the model from disk
data = torch.load(self.path)
self.config.load_state_dict(data)
It is possible to copy pre-tasks from one configuration to another by using
add_pretasks_from
. For instance
config2.add_pretasks_from(config1)
copies the pre-tasks of config1
to config2
.
Types
Possible types are:
- basic Python types (
str
,int
,float
,bool
) and pathspathlib.Path
- lists, using
typing.List[T]
- enumerations, using
Enum
from theenum
package - dictionaries (support for basic types in keys only) with
typing.Dict[U, V]
- Other configurations
Parameters
class MyConfig(Config):
"""My configuration
Long description of the configuration.
Attributes:
x: The parameter x
y: The parameter y
"""
# With default value
x: Param[type] = value
# Alternative syntax, useful to avoid class properties
x: Annotated[type, default(value)]
# Without default value
y: Param[type]
# Using a docstring
z: Param[int]
"""Most important parameter of the model"""
name
defines the name of the argument, which can be retrieved by the instanceself
(class) or passed as an argument (function)type
is the type of the argument (more details below)value
default value of the argument (if any). If the value equals to the default, the argument will not be included in the signature computation. This allows to add new parameters without changing the signature of past experiments (if the configuration is equivalent with the default value of course, otherwise do not use a default value!).
Constants
Constants are special parameters that cannot be modified. They are useful to note that the behavior of a configuration/task has changed, and thus that the signature should not be the same (as the result of the processing will differ).
class MyConfig(Config):
# Constant
version: Constant[str] = "2.1"
Metadata
Metadata are parameters which are ignored during the signature computation. For instance, the human readable name of a model would be a metadata. They are declared as parameters, but using the Meta
type hint
class MyConfig(Config):
"""
Attributes:
count: The number of documents in the collection
"""
count: Meta[type]
It is also possible to dynamically change the type of an argument using the setmeta
method:
from experimaestro import setmeta
# Forces the parameter to be a meta-parameter
a = setmeta(A(), True)
# Forces the parameter to be a meta-parameter
a = setmeta(A(), False)
Path option
It is possible to define special options that will be set to paths relative to the task directory. For instance,
class MyConfig(Config):
output: Annotated[Path, pathgenerator("output.txt")]
defines the instance variable path
as a path .../output.txt
within
the task directory. To ensure there are no conflicts, paths
are defined by following the config/task path, i.e. if the executed
task has a parameter model
, model
has a parameter optimization
,
and optimization a path parameter loss.txt
, then the file will be
./out/model/optimization/loss.txt
.
Validation
If a configuration has a __validate__
method, it is called to validate
the values before a task is submitted. This allows to fail fast when parameters
are not valid.
class ModelLearn(Config):
batch_size: Param[int] = 100
micro_batch_size: Param[int] = 100
parameters: Annotated[Path, pathgenerator("parameters.pth")]
def __validate__(self):
assert self.batch_size % self.micro_batch_size == 0