Databricks-Certified-Data-Engineer-Associate Questions and Answers

Question # 6

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

The table was managed

The table's data was smaller than 10 GB

The table's data was larger than 10 GB

The table was external

The table did not have a location

Full Access

Question # 7

Which tool is used by Auto Loader to process data incrementally?

Spark Structured Streaming

Unity Catalog

Checkpointing

Databricks SQL

Full Access

Question # 8

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved.

Which of the following approaches can the team use to reduce the time it takes to return results in this scenario?

They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized."

They can turn on the Auto Stop feature for the SQL endpoint.

They can increase the cluster size of the SQL endpoint.

They can turn on the Serverless feature for the SQL endpoint.

They can increase the maximum bound of the SQL endpoint's scaling range

Full Access

Question # 9

Which of the following describes the storage organization of a Delta table?

Delta tables are stored in a single file that contains data, history, metadata, and other attributes.

Delta tables store their data in a single file and all metadata in a collection of files in a separate location.

Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.

Delta tables are stored in a collection of files that contain only the data stored within the table.

Delta tables are stored in a single file that contains only the data stored within the table.

Full Access

Question # 10

A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data.

Which of the following relational objects should the data engineer create?

Spark SQL Table

View

Database

Temporary view

Delta Table

Full Access

Question # 11

Which of the following benefits is provided by the array functions from Spark SQL?

An ability to work with data in a variety of types at once

An ability to work with data within certain partitions and windows

An ability to work with time-related data in specified intervals

An ability to work with complex, nested data ingested from JSON files

An ability to work with an array of tables for procedural automation

Full Access

Question # 12

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

None of these changes will need to be made

The pipeline will need to stop using the medallion-based multi-hop architecture

The pipeline will need to be written entirely in SQL

The pipeline will need to use a batch source in place of a streaming source

The pipeline will need to be written entirely in Python

Full Access

Question # 13

In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

Checkpointing and Write-ahead Logs

Structured Streaming cannot record the offset range of the data being processed in each trigger.

Replayable Sources and Idempotent Sinks

Write-ahead Logs and Idempotent Sinks

Checkpointing and Idempotent Sinks

Full Access

Question # 14

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.

Which of the following approaches can the data engineer use to set up the new task?

They can clone the existing task in the existing Job and update it to run the new notebook.

They can create a new task in the existing Job and then add it as a dependency of the original task.

They can create a new task in the existing Job and then add the original task as a dependency of the new task.

They can create a new job from scratch and add both tasks to run concurrently.

They can clone the existing task to a new Job and then edit it to run the new notebook.

Full Access

Question # 15

Identify how the count_if function and the count where x is null can be used

Consider a table random_values with below data.

What would be the output of below query?

select count_if(col > 1) as count_a. count(*) as count_b.count(col1) as count_c from random_values col1

NULL -

3 6 5

4 6 5

3 6 6

4 6 6

Full Access

Question # 16

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which change will need to be made to the pipeline when migrating to Delta Live Tables?

The pipeline can have different notebook sources in SQL & Python.

The pipeline will need to be written entirely in SQL.

The pipeline will need to be written entirely in Python.

The pipeline will need to use a batch source in place of a streaming source.

Full Access

Question # 17

A data engineer needs access to a table new_uable, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.

Which approach can be used to identify the owner of new_table?

There is no way to identify the owner of the table

Review the Owner field in the table's page in the cloud storage solution

Review the Permissions tab in the table's page in Data Explorer

Review the Owner field in the table’s page in Data Explorer

Full Access

Question # 18

Identify the impact of ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE for a constraint violation.

A data engineer has created an ETL pipeline using Delta Live table to manage their company travel reimbursement detail, they want to ensure that the if the location details has not been provided by the employee, the pipeline needs to be terminated.

How can the scenario be implemented?

CONSTRAINT valid_location EXPECT (location = NULL)

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL UPDATE

CONSTRAINT valid_location EXPECT (location != NULL) ON DROP ROW

CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL

Full Access

Question # 19

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.

All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.

All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.

All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

Full Access

Question # 20

A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests toensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.

Which of the following commands could the data engineering team use to access sales in PySpark?

SELECT * FROM sales

There is no way to share data between PySpark and SQL.

spark.sql("sales")

spark.delta.table("sales")

spark.table("sales")

Full Access

Question # 21

A data engineer needs to create a table in Databricks using data from their organization's existing SQLite database. They run the following command:

CREATE TABLE jdbc_customer360

USING

OPTIONS (

url "jdbc:sqlite:/customers.db", dbtable "customer360"

)

Which line of code fills in the above blank to successfully complete the task?

autoloader

org.apache.spark.sql.jdbc

sqlite

org.apache.spark.sql.sqlite

Full Access

Question # 22

Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

When they are working interactively with a small amount of data

When they are running automated reports to be refreshed as quickly as possible

When they are working with SQL within Databricks SQL

When they are concerned about the ability to automatically scale with larger data

When they are manually running reports with a large amount of data

Full Access

Answer:

Explanation:

The scenario in which a data engineer will want to use a single-node cluster is when they are working interactively with a small amount of data. A single-node cluster is a cluster consisting of an Apache Spark driver and no Spark workers1. A single-node cluster supports Spark jobs and all Spark data sources, including Delta Lake1. A single-node cluster is helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis1. A single-node cluster can run Spark locally, spawn one executor thread per logical core in the cluster, and save all log output in the driver log1. A single-node cluster can be created by selecting the Single Node button when configuring a cluster1.

The other options are not suitable for using a single-node cluster. When running automated reports to be refreshed as quickly as possible, a data engineer will want to use a multi-node cluster that can scale up and down automatically based on the workload demand2. When working with SQL within Databricks SQL, a data engineer will want to use a SQL Endpoint that can execute SQL queries on a serverless pool or an existing cluster3. When concerned about the ability to automatically scale with larger data, a data engineer will want to use a multi-node cluster that can leverage the Databricks Lakehouse Platform and the Delta Engine to handle large-scale data processing efficiently and reliably4. When manually running reports with a large amount of data, a data engineer will want to use a multi-node cluster that can distribute the computation across multiple workers and leverage the Spark UI to monitor the performance and troubleshoot the issues.

Question # 23

A data engineer has created a new database using the following command:

CREATE DATABASE IF NOT EXISTS customer360;

In which of the following locations will the customer360 database be located?

dbfs:/user/hive/database/customer360

dbfs:/user/hive/warehouse

dbfs:/user/hive/customer360

More information is needed to determine the correct response

Full Access

Question # 24

A data engineer has been given a new record of data:

id STRING = 'a1'

rank INTEGER = 6

rating FLOAT = 9.4

Which of the following SQL commands can be used to append the new record to an existing Delta table my_table?

INSERT INTO my_table VALUES ('a1', 6, 9.4)

my_table UNION VALUES ('a1', 6, 9.4)

INSERT VALUES ( 'a1' , 6, 9.4) INTO my_table

UPDATE my_table VALUES ('a1', 6, 9.4)

UPDATE VALUES ('a1', 6, 9.4) my_table

Full Access

Question # 25

The Delta transaction log for the ‘students’ tables is shown using the ‘DESCRIBE HISTORY students’ command. A Data Engineer needs to query the table as it existed before the UPDATE operation listed in the log.

Which command should the Data Engineer use to achieve this? (Choose two.)

SELECT * FROM students@v4

SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:47.000+00:00’

SELECT * FROM students FROM HISTORY VERSION AS OF 3

SELECT * FROM students VERSION AS OF 5

SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:58.000+00:00’

Full Access

Question # 26

Which type of workloads are compatible with Auto Loader?

Streaming workloads

Machine learning workloads

Serverless workloads

Batch workloads

Full Access

Question # 27

Which two components function in the DB platform architecture’s control plane? (Choose two.)

Virtual Machines

Compute Orchestration

Serverless Compute

Compute

Unity Catalog

Full Access

Question # 28

What is stored in a Databricks customer's cloud account?

Data

Cluster management metadata

Databricks web application

Notebooks

Full Access

Question # 29

Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?

DROP

IGNORE

MERGE

APPEND

INSERT

Full Access

Question # 30

Which of the following SQL keywords can be used to convert a table from a long format to a wide format?

PIVOT

CONVERT

WHERE

TRANSFORM

SUM

Full Access

Question # 31

In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?

When another task needs to be replaced by the new task

When another task needs to fail before the new task begins

When another task has the same dependency libraries as the new task

When another task needs to use as little compute resources as possible

When another task needs to successfully complete before the new task begins

Full Access

Answer:

Explanation:

A data engineer can create a multi-task job in Databricks that consists of multiple tasks that run in a specific order. Each task can have one or more dependencies, which are other tasks that must run before the current task. The Depends On field of a new Databricks Job Task allows the data engineer to specify the dependencies of the task. The data engineer should select a task in the Depends On field when they want the new task to run only after the selected task has successfully completed. This can help the data engineer to create a logical sequence of tasks that depend on each other’s outputs or results. For example, a data engineer can create a multi-task job that consists of the following tasks:

Task A: Ingest data from a source using Auto Loader

Task B: Transform the data using Spark SQL

Task C: Write the data to a Delta Lake table

Task D: Analyze the data using Spark ML

Task E: Visualize the data using Databricks SQL

In this case, the data engineer can set the dependencies of each task as follows:

Task A: No dependencies

Task B: Depends on Task A

Task C: Depends on Task B

Task D: Depends on Task C

Task E: Depends on Task D

This way, the data engineer can ensure that each task runs only after the previous task has successfully completed, and the data flows smoothly from ingestion to visualization.

The other options are incorrect because they do not describe valid scenarios for selecting a task in the Depends On field. The Depends On field does not affect the following aspects of a task:

Whether the task needs to be replaced by another task

Whether the task needs to fail before another task begins

Whether the task has the same dependency libraries as another task

Whether the task needs to use as little compute resources as possible References: Create a multi-task job, Run tasks conditionally in a Databricks job, Databricks Jobs.

Question # 32

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to a data analytics dashboard for a retail use case. The job has a Databricks SQL query that returns the number of store-level records where sales is equal to zero. The data engineer wants their entire team to be notified via a messaging webhook whenever this value is greater than 0.

Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of stores with $0 in sales is greater than zero?

They can set up an Alert with a custom template.

They can set up an Alert with a new email alert destination.

They can set up an Alert with one-time notifications.

They can set up an Alert with a new webhook alert destination.

They can set up an Alert without notifications.

Full Access

Summer Sale - Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dpt65

DumpsTool Header

dumpstool logo

Databricks-Certified-Data-Engineer-Associate Questions and Answers

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Quick Links

Why Us

Updated Exams

Site Secure

Footer