Pre-Summer Sale - Special 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70dumps

NCP-AII Questions and Answers

Question # 6

A system administrator needs to install a GPU/DPU in a server. The server has a free PCI-e slot, there are enough free PCI-e lanes, and there is enough room for the card. Which procedure should be followed?

A.

Ensure the server has enough power. Verify compatibility of cables with server ' s platform. Make sure the server is down to remove cables safely. Do not wear an ESD bracelet.

B.

Ensure the server has enough power. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

C.

Ensure the server has enough power. Make sure the server is up and running with attached cables. Wear an ESD bracelet.

D.

Ensure the server has enough power. Verify compatibility of cables with server ' s platform. Make sure the server is down to remove cables safely. Wear an ESD bracelet.

Full Access
Question # 7

An engineer needs to completely remove NVIDIA GPU drivers from an Ubuntu 22.04 system to troubleshoot conflicts. Which command sequence ensures all driver components are purged?

A.

sudo ubuntu-drivers uninstall

B.

sudo rm -rf /usr/lib/nvidia

C.

sudo apt-get remove nvidia-driver-550

D.

sudo apt-get purge nvidia-* & & sudo apt-get autoremove

Full Access
Question # 8

During cluster deployment, the UFM Cable Validation Tool reports " Wrong-neighbor " errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?

A.

Reboot all leaf switches to force LLDP rediscovery.

B.

Replace all affected cables with higher-grade OM5 fiber optics.

C.

Verify LLDP data against topology files and remediate.

D.

Disable FEC on all switches to bypass neighbor validation.

Full Access
Question # 9

After a recent OS upgrade, you need to reinstall NVIDIA GPU and DOCA drivers to support both AI training and accelerated networking. What best practice ensures successful installation and full hardware capability?

A.

Download and install only the specific versions of GPU and DOCA drivers listed as compatible with the current OS and hardware.

B.

Apply legacy drivers for hardware released within the last two years to maintain maximum compatibility across versions.

C.

Install the latest available drivers directly from the NVIDIA website.

D.

Use the default drivers provided by the Linux distribution, unless an installation fails during system boot.

Full Access
Question # 10

An administrator needs to add additional GPUs to an existing server. What are the server requirements to check before installing new GPUs?

A.

Sufficient networking, water-cooled racks, adequate rack power, sufficient storage, and rack space.

B.

Sufficient storage, sufficient networking, adequate rack power, and compatible hardware.

C.

Sufficient CPU capacity, PCIe slot allocation, sufficient cooling in the data center, and rack space.

D.

Sufficient cooling in the data center, adequate rack power, compatible hardware, and PCIe slot allocation.

Full Access
Question # 11

You are standing up an NVIDIA DGX system for enterprise production. Stakeholder teams require system reliability, performance consistency under load, and proper escalation processes before release. A recent system in another cluster experienced intermittent GPU failures attributed to missed early-stage validation. Which deployment and validation sequence best addresses production readiness and mitigates the risk of avoidable downtime or performance loss?

A.

Install latest OS images and drivers, confirm OS and container functionality, invite users for a monitored production trial, and collect workload feedback to plan any further diagnostics or updates.

B.

Complete hardware and cabling, power on the system, update firmware and drivers, run full hardware health checks and stress diagnostics using NVSM, verify all GPU and system sensor logs, and validate GPU accessibility.

C.

Update network topology, assign static IPs and DNS entries, register the system with NVIDIA, then conduct basic OS-level checks and enable user access after login testing is successful.

D.

Power on the system, install all AI frameworks, configure the CUDA and library stack, set up user environments, then plan stress tests and diagnostics as part of ongoing routine operations.

Full Access
Question # 12

A user wants to restrict a Docker container to use only GPUs 0 and 2. Which command achieves this?

A.

docker run --gpus ' " device=0,2 " ' nvidia/cuda:12.1-base nvidia-smi

B.

docker run -e NVIDIA_VISIBLE_DEVICES=0,2 nvidia/cuda:12.1-base nvidia-smi

C.

docker run --gpus all nvidia/cuda:12.1-base nvidia-smi -id=0,2

D.

docker run --device /dev/nvidia0,/dev/nvidia2 nvidia/cuda:12.1-base nvidia-smi

Full Access
Question # 13

You are preparing a Spectrum-based NVIDIA switch for integration into a production AI cluster. To confirm that all modules are running approved firmware versions, you must use the appropriate command from the switch CLI. Which step most accurately meets best practices for ensuring firmware version consistency and cluster compliance?

A.

Use the show version command to check the overall system version and confirm all modules are updated if the system version matches the documentation.

B.

Use the show interfaces status command to verify all ports are up, and proceed with integration if no interface errors are shown.

C.

Use the show asic-version command to review firmware versions for all modules, then compare these against the documented approved versions.

D.

Use the show inventory command to display component details and serial numbers before proceeding, as this output will include all firmware versions for review.

Full Access
Question # 14

ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

A.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Critical failure; expected is > 390 GB/s for HDR InfiniBand.

D.

Inconclusive; rerun with --stress=cpu to validate.

Full Access
Question # 15

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

A.

Implement redundant switches with spanning tree protocol.

B.

MLAG for bonded interfaces across redundant switches.

C.

Use only one switch for all management and storage traffic.

D.

Disable VLANs and use unmanaged switches.

Full Access
Question # 16

An administrator is configuring node categories in BCM for a DGX BasePOD cluster. They need to group all NVIDIA DGX H200 nodes under a dedicated category for GPU-accelerated workloads. Which approach aligns with NVIDIA ' s recommended BCM practices?

A.

Assign nodes to the ’login " category to simplify Slurm integration.

B.

Create a new " dgx-h200 " category, assign all DGX H200 nodes to it.

C.

Use the existing " dgxnodes " category without modification, as it is preconfigured for all DGX systems.

D.

Avoid categories and configure each DGX node individually via CLI.

Full Access
Question # 17

You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?

Pick the 2 correct responses below.

A.

Assign a floating Virtual IP address that can automatically migrate between the primary and secondary head nodes during failover.

B.

Compute nodes must be powered on and performing work to initiate synchronization of the head nodes.

C.

After configuration is complete, simulate a failover by stopping BCM services on the active head node to verify that all services are running on the secondary node with no interruption.

D.

Configure both head nodes to use independent static IP addresses for BCM services instead of relying on a shared virtual IP address.

E.

During configuration, explicitly synchronize both the configuration and state data directories from the primary to the secondary head node to ensure consistency.

Full Access
Question # 18

You are training a deep neural network using NCCL to coordinate communication across four GPUs in a single node. During early performance testing, you notice inconsistent scaling and longer-than-expected training times, even though all GPUs are being used. Which strategy would most effectively improve NCCL efficiency and collective operation performance in this setting?

A.

Adjust the batch size so that each GPU receives an equal-sized portion of the batch, ensuring all GPUs process similar workloads and communication is evenly distributed.

B.

Assign the largest possible workload to the first GPU to maximize its utilization, and allow the remaining GPUs to process smaller or variable batch sizes as needed.

C.

Disable automatic load balancing so that the deep learning framework can dynamically assign samples to any GPU available during each iteration.

D.

Increase the communication frequency between GPUs while allowing workloads to be unevenly split, so synchronization is more frequent and model updates happen faster.

Full Access
Question # 19

After initial setup and health checks, the DGX H100 system administrator wants to verify that containers can access GPUs before running production workloads. Which method is recommended for this validation?

A.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 systemctl

B.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 ls -la

C.

sudo docker run --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

D.

sudo docker run --gpus all --rm nvcr.io/nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

Full Access
Question # 20

A systems engineer is updating firmware across a large DGX cluster using automation. What is the best practice for minimizing risk and ensuring cluster health during and after the process?

A.

Drain nodes from the scheduler, run pre-update diagnostics, update firmware in batches, and verify health post-update before scaling to the next batch.

B.

To save time, simultaneously update all nodes in the cluster without draining or diagnostics.

C.

Update nodes that have reported faults, leaving others on older firmware.

D.

Drain nodes from the scheduler, update firmware in batches, skip diagnostics and verify health post-update before scaling to the next batch.

Full Access
Question # 21

During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?

A.

nvidia-smi --query-disk-space

B.

du -sh /home/*

C.

df -h | grep ' /var '

D.

lsof +L1

Full Access
Question # 22

A single-node stress test fails during the PCIe bandwidth validation phase. Which troubleshooting step is recommended first?

A.

Reduce PCIe Gen4 speed to Gen3 speed in BIOS settings.

B.

Reseat the GPU, then rerun the test.

C.

Disable NVLink in BIOS to isolate PCIe performance.

D.

Reinstall NVIDIA drivers using apt-get install nvidia-driver-550.

Full Access
Question # 23

An enterprise IT team has completed the physical installation of an AI Factory with a Spectrum-X Ethernet network connected to all GPU servers. They now need to ensure the environment is ready for scalable AI workload deployment. What is the recommended sequence of validation steps?

A.

Set up Active Directory and LDAP, configure role-based access controls and security settings first, install users, and skip network or hardware performance validation.

B.

Perform application benchmarking first, use performance logs to identify bottlenecks, update switch and server firmware afterward, and then tune the network using performance tests.

C.

Validate the software stack, test link connectivity and port health, run network benchmarks, run OSPF, ensure neighbors are exchanging route information, then stage AI workload tests.

D.

Confirm switch and server firmware configuration, test link connectivity and port health, run network benchmarks, validate the software stack, then stage AI workload tests.

Full Access
Question # 24

What is the primary purpose of running an NCCL burn-in test on a new GPU cluster?

A.

To test whether GPUs are properly detected by the operating system and have the correct drivers installed.

B.

To maximize GPU utilization for machine learning workloads and automatically tune deep learning frameworks.

C.

To detect and resolve hardware or interconnect issues before production by stressing GPU communication links.

D.

To benchmark application-specific runtime performance of AI models using real user data and production training scripts.

Full Access
Question # 25

An engineer wants to verify that an NVIDIA GPU is accessible inside a Docker container for running deep learning workloads. The NVIDIA Container Toolkit is installed on a machine with working NVIDIA drivers. Which command demonstrates the correct way to run a container that can access all available GPUs?

A.

docker run --rm --runtime=docker nvidia/cuda nvidia-smi

B.

docker run --rm -it ubuntu:22.04 nvidia-smi

C.

docker run --rm --gpus all nvidia/cuda:12.4.6-base-ubuntu22.04 nvidia-smi

D.

docker run --rm nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Full Access
Question # 26

What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?

Pick the 2 correct responses below.

A.

To measure the storage network performance.

B.

To measure the latency between GPUs.

C.

To measure the power consumption of GPUs.

D.

To measure bandwidth between GPUs.

Full Access
Question # 27

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?

A.

QSA adapter.

B.

SFP connectors.

C.

SFP-to-1G BASE-T RJ45 adapter.

D.

Standard QSFP-to-QSFP DAC cable.

Full Access
Question # 28

Which of the following steps are essential components of a recommended DGX cluster installation procedure?

Pick the 2 correct responses below.

A.

Group nodes by function during initial setup and assign them to relevant categories in the cluster management tool.

B.

Configure networking by validating all interfaces on each node, ensuring proper InfiniBand and Ethernet connectivity prior to installing cluster software.

C.

Install Slurm on the head node and then configure the compute nodes’ default OS images.

D.

Complete application containerization, run distributed jobs, and skip validation of node health or storage availability.

Full Access
Question # 29

During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?

A.

Set blocksize= " 1GB " for data loading and enable RMM asynchronous allocation.

B.

Switch from FP16 to FP32 precision for numerical stability.

C.

Disable add_filename for Parquet files to reduce metadata.

D.

Increase files_per_partition to 1000 for larger batch processing.

Full Access
Question # 30

You are evaluating the integration of NVIDIA BlueField DPUs into your data center ' s storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?

A.

Unlimited scalability by adding more DPUs without architectural changes.

B.

Elimination of latency issues in data processing tasks.

C.

Reduced CPU load by offloading data processing tasks to DPUs.

D.

Enhanced I/O performance with NVMe storage access speeds.

Full Access
Question # 31

A DGX server reports degraded performance and storage alerts. How would you use NVSM and nvidia-smi to troubleshoot both system and GPU issues?

A.

Use nvsm show health for a system health summary, nvsm show storage for storage issues, and nvidia-smi -q to get detailed GPU information.

B.

Run nvsm collect-stats to gather logs, use lsblk to understand if there are storage problems, and nvidia-smi -q to get detailed GPU information.

C.

Start by issuing nvidia-smi -L to list GPUs, followed by nvsm --refresh to clear all alerts, and nvidia-smi -q to get detailed GPU information.

D.

Run nvsm reset to restore system health, then use nvidia-smi --fix for automatic GPU repairs and status recovery.

Full Access
Question # 32

You must validate all physical cabling as part of the network bring-up phase in a new NVIDIA GPU cluster deployment. The design requires you to confirm that each cable matches the intended topology, all links are functional, and future troubleshooting and scalability are supported. Which two steps are essential to an effective recommended cabling validation process during cluster deployment?

Pick the 2 correct responses below.

A.

Focus on validating the highest bandwidth links first, deferring non-critical cable mislabeling until after initial workloads are deployed and tested.

B.

Run link tests only after the entire network is built and powered on to avoid redundant troubleshooting during bring-up.

C.

Run the cable validation process incrementally during deployment, section by section, to catch and resolve errors as early as possible.

D.

Compare every cable’s physical connection to the planned topology diagram and validate correct ports and link paths.

Full Access
Question # 33

A user needs to configure NGC CLI to access resources across multiple organizations. What is the recommended command syntax to achieve this?

A.

export NGC_CLI_ORG=org-name & & ngc config set

B.

ngc config list to manually edit the JSON configuration file.

C.

ngc registry login --org org-name

D.

ngc config set --org org-name --ace ace-name

Full Access
Question # 34

You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager (BCM) cluster. Which two of the following actions are essential for a successful OS installation on the cluster ' s head node? (Pick the 2 correct responses below)

A.

Configure network switches for PXE boot to all compute nodes before installing the OS on the head node.

B.

Download the latest BCM ISO and verify its integrity using the provided checksum, then start the installation.

C.

Start the head node OS installation process with the system BIOS set to legacy boot mode instead of UEFI.

D.

Set the desired time zone and configure NTP synchronization during the OS installation wizard.

Full Access
Question # 35

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?

A.

SFP Connectors

B.

SFP to 1G BASE-T (RJ45) adapter

C.

QSA Adapter

Full Access
Question # 36

You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)

A.

Use the command sudo mlxconfig -d /dev/mst/ < device > set LINK_TYPE_P1=2 to enable Ethernet on the Bluefield-3 devices.

B.

Use the command sudo mlxconfig -d /dev/mst/ < device > set DISABLE_SPECTRUM_X=1 to reduce overhead.

C.

Use the command sudo mlxconfig -d /dev/mst/ < device > set INTERNAL_CPU_OFFLOAD_ENGINE=1 to configure the SuperNIC to operate in NIC mode.

D.

Use the command sudo mlxconfig -d /dev/mst/ < device > set DPU_MODE=1 to set up the Bluefield-3 devices in DPU mode.

Full Access