Month End Sale - Special 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70dumps

NCP-AIO Questions and Answers

Question # 6

You are managing a deep learning workload on a Slurm cluster with multiple GPU nodes, but you notice that jobs requesting multiple GPUs are waiting for long periods even though there are available resources on some nodes.

How would you optimize job scheduling for multi-GPU workloads?

A.

Reduce memory allocation per job so more jobs can run concurrently, freeing up resources faster for multi-GPU workloads.

B.

Ensure that job scripts use --gres=gpu: and configure Slurm’s backfill scheduler to prioritize multi-GPU jobs efficiently.

C.

Set up separate partitions for single-GPU and multi-GPU jobs to avoid resource conflicts between them.

D.

Increase time limits for smaller jobs so they don’t interfere with multi-GPU job scheduling.

Full Access
Question # 7

You are an administrator managing a large-scale Kubernetes-based GPU cluster using Run:AI.

To automate repetitive administrative tasks and efficiently manage resources across multiple nodes, which of the following is essential when using the Run:AI Administrator CLI for environments where automation or scripting is required?

A.

Use the runai-adm command to directly update Kubernetes nodes without requiring kubectl.

B.

Use the CLI to manually allocate specific GPUs to individual jobs for better resource management.

C.

Ensure that the Kubernetes configuration file is set up with cluster administrative rights before using the CLI.

D.

Install the CLI on Windows machines to take advantage of its scripting capabilities.

Full Access
Question # 8

A system administrator needs to collect the information below:

    GPU behavior monitoring

    GPU configuration management

    GPU policy oversight

    GPU health and diagnostics

    GPU accounting and process statistics

    NVSwitch configuration and monitoring

What single tool should be used?

A.

nvidia-smi

B.

CUDA Toolkit

C.

DCGM

D.

Nsight Systems

Full Access
Question # 9

You are deploying AI applications at the edge and want to ensure they continue running even if one of the servers at an edge location fails.

How can you configure NVIDIA Fleet Command to achieve this?

A.

Use Secure NFS support for data redundancy.

B.

Set up over-the-air updates to automatically restart failed applications.

C.

Enable high availability for edge clusters.

D.

Configure Fleet Command's multi-instance GPU (MIG) to handle failover.

Full Access
Question # 10

You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users.

What mechanism is responsible for ensuring that the workload continues to run without interruption?

A.

Load balancing across all nodes in the cluster.

B.

Manual intervention by the system administrator to restart services.

C.

The failover mechanism that automatically transfers workloads to a standby node.

D.

Data replication between nodes to ensure data integrity.

Full Access
Question # 11

An administrator is troubleshooting issues with NVIDIA GPUDirect storage and must ensure optimal data transfer performance.

What step should be taken first?

A.

Increase the GPU's core clock frequency.

B.

Upgrade the CPU to a higher clock speed.

C.

Check for compatible RDMA-capable network hardware and configurations.

D.

Install additional GPU memory (VRAM).

Full Access
Question # 12

After completing the installation of a Kubernetes cluster on your NVIDIA DGX systems using BCM, how can you verify that all worker nodes are properly registered and ready?

A.

Run kubectl get nodes to verify that all worker nodes show a status of “Ready”.

B.

Run kubectl get pods to check if all worker pods are running as expected.

C.

Check each node manually by logging in via SSH and verifying system status with systemctl.

Full Access
Question # 13

Which of the following correctly identifies the key components of a Kubernetes cluster and their roles?

A.

The control plane consists of the kube-apiserver, etcd, kube-scheduler, and kube-controller-manager, while worker nodes run kubelet and kube-proxy.

B.

Worker nodes manage the kube-apiserver and etcd, while the control plane handles all container runtimes.

C.

The control plane is responsible for running all application containers, while worker nodes manage network traffic through etcd.

D.

The control plane includes the kubelet and kube-proxy, and worker nodes are responsible for running etcd and the scheduler.

Full Access
Question # 14

A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.

Why would generating core dumps be a critical step in troubleshooting this issue?

A.

Core dumps prevent future crashes by stopping any further execution of the faulty process.

B.

Core dumps provide real-time logs that can be used to monitor ongoing application performance.

C.

Core dumps restore the process to its previous state, often fixing the error-causing crash.

D.

Core dumps capture the memory state of the process at the time of the crash.

Full Access
Question # 15

Which two (2) ways does the pre-configured GPU Operator in NVIDIA Enterprise Catalog differ from the GPU Operator in the public NGC catalog? (Choose two.)

A.

It is configured to use a prebuilt vGPU driver image.

B.

It supports Mixed Strategies for Kubernetes deployments.

C.

It automatically installs the NVIDIA Datacenter driver.

D.

It is configured to use the NVIDIA License System (NLS).

E.

It additionally installs Network Operator.

Full Access
Question # 16

A system administrator wants to run these two commands in Base Command Manager.

main

showprofile device status apc01

What command should the system administrator use from the management node system shell?

A.

cmsh -c “main showprofile; device status apc01”

B.

cmsh -p “main showprofile; device status apc01”

C.

system -c “main showprofile; device status apc01”

D.

cmsh-system -c “main showprofile; device status apc01”

Full Access
Question # 17

An organization only needs basic network monitoring and validation tools.

Which UFM platform should they use?

A.

UFM Enterprise

B.

UFM Telemetry

C.

UFM Cyber-AI

D.

UFM Pro

Full Access
Question # 18

A Slurm user needs to display real-time information about the running processes and resource usage of a Slurm job.

Which command should be used?

A.

smap -j

B.

scontrol show job

C.

sstat -j

D.

sinfo -j

Full Access
Question # 19

You have successfully pulled a TensorFlow container from NGC and now need to run it on your stand-alone GPU-enabled server.

Which command should you use to ensure that the container has access to all available GPUs?

A.

kubectl create pod --gpu=all nvcr.io/nvidia/tensorflow:

B.

docker run nvcr.io/nvidia/tensorflow:

C.

docker start nvcr.io/nvidia/tensorflow:

D.

docker run --gpus all nvcr.io/nvidia/tensorflow:

Full Access