About computational performances
The main performance bottleneck of physo is free constant optimization, therefore, in non-parallel execution mode, performances are almost linearly dependent on the number of free constant optimization steps and on the number of trial expressions per epoch (ie. the batch size).
In addition, it should be noted that generating monitoring plots takes ~3s flat, therefore we suggest making monitoring plots every >10 epochs for low time / epoch cases.
Please note that using a CPU typically results in higher performances than when using a GPU.
Expected perfs (SR)
Summary of expected performances with physo (in parallel mode):
Time / epoch |
Device |
config |
Batch size |
free const opti steps |
Example |
# free const |
|---|---|---|---|---|---|---|
~100s |
CPU: Intel W-2155 10c/20t <br>RAM: 128 Go |
config1 |
10k |
20 |
eg: demo_damped_harmonic_oscillator |
3 |
~70s |
CPU: Mac M1 <br>RAM: 16 Go |
config1 |
10k |
20 |
eg: demo_damped_harmonic_oscillator |
3 |
~300s |
CPU: Intel i7 4770 <br>RAM: 16 Go |
config1 |
10k |
20 |
eg: demo_damped_harmonic_oscillator |
3 |
~400s |
GPU: Nvidia GV100 <br>VRAM : 32 Go |
config1 |
10k |
20 |
eg: demo_damped_harmonic_oscillator |
3 |
~5s (1s wop) |
CPU: Intel W-2155 10c/20t <br>RAM: 128 Go |
config0 |
1k |
15 |
eg: sr_quick_start |
2 |
~5s (1s wop) |
CPU: Mac M1 <br>RAM: 16 Go |
config0 |
1k |
15 |
eg: sr_quick_start |
2 |
~30s |
CPU: Intel i7 4770 <br>RAM: 16 Go |
config0 |
1k |
15 |
eg: sr_quick_start |
2 |
~5s |
GPU: Nvidia GV100 <br>VRAM : 32 Go |
config0 |
1k |
15 |
eg: sr_quick_start |
2 |
*wop = without parallelization
Expected perfs (Class SR)
Summary of expected performances with physo (in parallel mode):
Time / epoch |
Device |
config |
Batch size |
free const opti steps |
Example |
# free const |
|---|---|---|---|---|---|---|
~1000s wop |
CPU: Intel W-2155 10c/20t <br>RAM: 128 Go |
config1b |
10k |
60 |
eg: MW_streams_run |
100 |
~ |
CPU: Mac M1 <br>RAM: 16 Go |
config1b |
10k |
60 |
eg: MW_streams_run |
100 |
~ |
GPU: Nvidia GV100 <br>VRAM : 32 Go |
config1b |
10k |
60 |
eg: MW_streams_run |
100 |
~100s wop |
CPU: Intel W-2155 10c/20t <br>RAM: 128 Go |
config0b |
1k |
30 |
eg: class_sr_quick_start |
10 |
~20s (40s wop) |
CPU: Mac M1 <br>RAM: 16 Go |
config0b |
1k |
30 |
eg: class_sr_quick_start |
10 |
~ |
GPU: Nvidia GV100 <br>VRAM : 32 Go |
config0b |
1k |
30 |
eg: class_sr_quick_start |
10 |
*wop = without parallelization
In Class SR mode, the number of free constants is typically much higher than in SR mode, parallelization is generally not worth it.
Parallel mode
Parallel free constant optimization
Parallel constant optimization is enabled if and only if :
The system is compatible (checked by
physo.physym.batch_execute.ParallelExeAvailability).parallel_mode = Truein the reward computation configuration.physo.physym.reward.USE_PARALLEL_OPTI_CONST = True.
By default, both of these are true as parallel mode is typically faster for this task. However, if you are using a batch size <10k, due to communication overhead it might be worth it to disable it for this task via:
physo.physym.reward.USE_PARALLEL_OPTI_CONST = False
or simply disabling it when calling physo.SR or physo.ClassSR by setting:
physo.SR(
...
parallel_mode = False
....
)
Parallel reward computation
Parallel reward computation is enabled if and only if :
The system is compatible (checked by
physo.physym.batch_execute.ParallelExeAvailability).parallel_mode = Truein the reward computation configuration.physo.physym.reward.USE_PARALLEL_EXE = True.
By default, physo.physym.reward.USE_PARALLEL_EXE = False, i.e. parallelization is not used for this task due to communication overhead making it typically slower for such individually inexpensive tasks.
However, if you are using \(>10^6\) data points it tends to be faster, so we recommend enabling it by setting:
physo.physym.reward.USE_PARALLEL_EXE = True
Miscellaneous
Efficiency curves (nb. of CPUs vs individual task time) are produced by
batch_execute_UnitParallelTest.pyin realistic toy case with batch size = 10k and \(10^3\) data points.Parallel mode is not available from jupyter notebooks on any systems (MACs/Linux/Windows), run .py scripts to use it.
The use of
parallel_modecan be managed in the configuration of the reward which can itself be managed through a hyperparameter config file (seeconfigfolder) which is handy for running a benchmark on an HPC with a predetermined number of CPUs.Disabling parallel mode entirely via
USE_PARALLEL_EXE=FalseUSE_PARALLEL_OPTI_CONST=Falseis recommended before runningphysoin a debugger.
Efficiency curve in a realistic case
Computational time optimizing free constants \({a, b }\) in \(y = a \sin (b.x) + e^{-x}\) over 20 iterations using \(10^3\) data points when running this task \(10\ 000\) times in parallel on an Apple M1 CPU (a typically fast single core CPU) and an Intel Xeon W-2155 CPU (a typically high core count CPU).