You are managing a deep learning workload on a Slurm cluster with multiple GPU nodes, but you notice that jobs requesting multiple GPUs are waiting for long periods even though there are available resources on some nodes.
How would you optimize job scheduling for multi-GPU workloads?
You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.
To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?
A system administrator needs to collect the information below:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
What single tool should be used?
You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.
How can you configure NVIDIA Fleet Command to achieve this?
You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users.
What mechanism is responsible for ensuring that the workload continues to run without interruption?
An administrator is troubleshooting issues with NVIDIA GPUDirect storage and must ensure optimal data transfer performance.
What step should be taken first?
After completing the installation of a Kubernetes cluster on your NVIDIA DGX systems using BCM, how can you verify that all worker nodes are properly registered and ready?
Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?
A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.
Why would generating core dumps be a critical step in troubleshooting this issue?
Which two (2) ways does the pre-configured GPU Operator in NVIDIA Enterprise Catalog differ from the GPU Operator in the public NGC catalog? (Choose two.)
A system administrator wants to run these two commands in Base Command Manager.
main
showprofile device status apc01
What command should the system administrator use from the management node system shell?
An organization only needs basic network monitoring and validation tools.
Which UFM platform should they use?
A Slurm user needs to display real-time information about the running processes and resource usage of a Slurm job.
Which command should be used?
You have successfully pulled a TensorFlow container from NGC and now need to run it on your stand-alone GPU-enabled server.
Which command should you use to ensure that the container has access to all available GPUs?