Le cluster ensicompute
Le cluster est composé de 13 serveurs Dell R740.
Configuration matérielle
La configuration des 13 serveurs est identique. Seuls diffèrent le nombre et le modèle de GPUs qu'ils embarquent.
| Noeuds | Modèle | #CPUs x #CoresPerCPU x #ThreadsPerCore | RAM | #GPUs |
Modèle |
VRAM |
|---|---|---|---|---|---|---|
tesla |
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz | 2 x 10 x 2 = 40 | 128Gb | 1 | Tesla V100 | 32Gb |
turing-1 ... turing-11 |
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz | 2 x 10 x 2 = 40 | 128Gb | 3 | Quadro RTX 6000 | 24Gb |
ampere |
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz | 2 x 10 x 2 = 40 | 128Gb | 3 | A40 | 24Gb |
Focus sur les GPUs
La plateforme dispose d'une carte NVIDIA V100, de 3 cartes NVIDIA A40 et de 31 cartes NVIDIA RTX 6000. Les informations suivantes donnent des détails sur leurs caractéristiques et la façon dont ils sont connectés au niveau hardware (affinité GPU / cores).
NVIDIA RTX 6000
Les compute nodes turing-[1..11] sont équipés de 3 GPUs NVIDIA RTX 6000.
root@turing-11:~# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro RTX 6000 Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P8 13W / 250W | 0MiB / 23040MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Quadro RTX 6000 Off | 00000000:AF:00.0 Off | 0 |
| N/A 30C P8 13W / 250W | 0MiB / 23040MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Quadro RTX 6000 Off | 00000000:D8:00.0 Off | 0 |
| N/A 30C P8 13W / 250W | 0MiB / 23040MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
root@turing-11:~# nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS 0,2,4,6,8,10 0 N/A
GPU1 SYS X SYS 1,3,5,7,9,11 1 N/A
GPU2 SYS SYS X 1,3,5,7,9,11 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NVIDIA V100
Le compute node tesla est équipé d'une carte NVIDIA V100.
root@tesla:~# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-32GB Off | 00000000:3B:00.0 Off | 0 |
| N/A 34C P0 25W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@tesla:~# nvidia-smi topo -m
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0,2,4,6,8,10 0 N/A
Legend:
X = Self
NVIDIA A40
Le compute node ampere est équipé de 3 cartes NVIDIA A40.
root@ampere:~# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:3B:00.0 Off | 0 |
| 0% 31C P8 22W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A40 Off | 00000000:AF:00.0 Off | 0 |
| 0% 32C P8 21W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A40 Off | 00000000:D8:00.0 Off | 0 |
| 0% 31C P8 12W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
root@ampere:~# nvidia-smi topo -m
GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS 0,2,4,6,8,10 0 N/A
GPU1 SYS X SYS 1,3,5,7,9,11 1 N/A
GPU2 SYS SYS X 1,3,5,7,9,11 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)