Databricks-Certified-Professional-Data-Engineer Questions and Answers

Question # 6

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Can manage

Can edit

Can run

Can Read

Full Access

Question # 7

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Because the vacuum command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours.

Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the vacuum job is run the following day.

Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges.

Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes.

Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the vacuum job is run 8 days later.

Full Access

Question # 8

A DLT pipeline includes the following streaming tables:

Raw_lot ingest raw device measurement data from a heart rate tracking device.

Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

Set the skipChangeCommits flag to true on bpm_stats

Set the SkipChangeCommits flag to true raw_lot

Set the pipelines, reset, allowed property to false on bpm_stats

Set the pipelines, reset, allowed property to false on raw_iot

Full Access

Question # 9

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

to_interval("event_time", "5 minutes").alias("time")

window("event_time", "5 minutes").alias("time")

"event_time"

window("event_time", "10 minutes").alias("time")

lag("event_time", "10 minutes").alias("time")

Full Access

Question # 10

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users.

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Full Access

Question # 11

A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

They are merged.

They are ignored.

They are updated.

They are inserted.

They are deleted.

Full Access

Question # 12

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

Which limitation will the team face while diagnosing this problem?

New fields not be computed for historic records.

Updating the table schema will invalidate the Delta transaction log metadata.

Updating the table schema requires a default value provided for each file added.

Spark cannot capture the topic partition fields from the kafka source.

Full Access

Question # 13

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address <> customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

Full Access

Question # 14

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Full Access

Question # 15

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Set the configuration delta.deduplicate = true.

VACUUM the Delta table after each batch completes.

Perform an insert-only merge with a matching condition on a unique key.

Perform a full outer join on a unique key and overwrite existing data.

Rely on Delta Lake schema enforcement to prevent duplicate records.

Full Access

Question # 16

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

userLookup.join(streamingDF, ["userid"], how="inner")

streamingDF.join(userLookup, ["user_id"], how="outer")

streamingDF.join(userLookup, ["user_id”], how="left")

streamingDF.join(userLookup, ["userid"], how="inner")

userLookup.join(streamingDF, ["user_id"], how="right")

Full Access

Question # 17

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.

Other than the default "admins" group, only individual users can be granted privileges on jobs.

A user can only transfer job ownership to a group if they are also a member of that group.

Only workspace administrators can grant "Owner" privileges to a group.

Full Access

Question # 18

A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.

The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

No; Delta Lake manages streaming checkpoints in the transaction log.

Yes; both of the streams can share a single checkpoint directory.

No; only one stream can write to a Delta Lake table.

Yes; Delta Lake supports infinite concurrent writers.

No; each of the streams needs to have its own checkpoint directory.

Full Access

Question # 19

Which statement regarding spark configuration on the Databricks platform is true?

Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

When the same spar configuration property is set for an interactive to the same interactive cluster.

Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster

The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.

Full Access

Question # 20

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

date = spark.conf.get("date")

input_dict = input()

date= input_dict["date"]

import sys

date = sys.argv[1]

date = dbutils.notebooks.getParam("date")

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")

Full Access

Answer:

Explanation:

The code block that should be used to create the date Python variable used in the above code block is:

dbutils.widgets.text(“date”, “null”) date = dbutils.widgets.get(“date”)

This code block uses the dbutils.widgets API to create and get a text widget named “date” that can accept a string value as a parameter1. The default value of the widget is “null”, which means that if no parameter is passed, the date variable will be “null”. However, if a parameter is passed through the Databricks Jobs API, the date variable will be assigned the value of the parameter. For example, if the parameter is “2021-11-01”, the date variable will be “2021-11-01”. This way, the notebook can use the date variable to load data from the specified path.

The other options are not correct, because:

Option A is incorrect because spark.conf.get(“date”) is not a valid way to get a parameter passed through the Databricks Jobs API. The spark.conf API is used to get or set Spark configuration properties, not notebook parameters2.
Option B is incorrect because input() is not a valid way to get a parameter passed through the Databricks Jobs API. The input() function is used to get user input from the standard input stream, not from the API request3.
Option C is incorrect because sys.argv1 is not a valid way to get a parameter passed through the Databricks Jobs API. The sys.argv list is used to get the command-line arguments passed to a Python script, not to a notebook4.
Option D is incorrect because dbutils.notebooks.getParam(“date”) is not a valid way to get a parameter passed through the Databricks Jobs API. The dbutils.notebooks API is used to get or set notebook parameters when running a notebook as a job or as a subnotebook, not when passing parameters through the API5.

References: Widgets, Spark Configuration, input(), sys.argv, Notebooks

Question # 21

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

A managed table will be created in the DBFS root storage container.

Full Access

Question # 22

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

No: files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.

Yes: Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.

No: the change data feed only tracks inserts and updates not deleted records.

No: the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command

Full Access

Question # 23

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Delta Lake statistics are not optimized for free text fields with high cardinality.

Text data cannot be stored with Delta Lake.

ZORDER ON review will need to be run to see performance gains.

The Delta log creates a term matrix for free text fields to support selective filtering.

Delta Lake statistics are only collected on the first 4 columns in a table.

Full Access

Question # 24

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

Use &Pip install in a notebook cell

Run source env/bin/activate in a notebook setup script

Install libraries from PyPi using the cluster UI

Use &sh install in a notebook cell

Full Access

Question # 25

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.

Which approach will ensure that this requirement is met?

When a database is being created, make sure that the LOCATION keyword is used.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Full Access

Question # 26

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

Option A

Option B

Option C

Option D

Option E

Full Access

Question # 27

What statement is true regarding the retention of job run history?

It is retained until you export or delete job run logs

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

t is retained for 60 days, during which you can export notebook run results to HTML

It is retained for 60 days, after which logs are archived

It is retained for 90 days or until the run-id is re-used through custom run configuration

Full Access

Question # 28

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Full Access

Question # 29

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.

Which statement describes a limitation of Databricks Secrets?

Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.

Account administrators can see all secrets in plain text by logging on to the Databricks Accounts console.

Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.

Iterating through a stored secret and printing each character will display secret contents in plain text.

The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.

Full Access

Question # 30

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Which statement describes this implementation?

The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Full Access

Question # 31

A data team's Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

Increase the shuffle partitions to account for additional aggregates

Specify a new checkpointlocation

Run REFRESH TABLE delta, /item_agg'

Remove .option (mergeSchema', true') from the streaming write

Full Access

Question # 32

Which statement describes integration testing?

Validates interactions between subsystems of your application

Requires an automated testing framework

Requires manual intervention

Validates an application use case

Validates behavior of individual elements of your application

Full Access

Question # 33

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

The job_id is returned in this field.

The job_id and number of times the job has been are concatenated and returned.

The number of times the job definition has been run in the workspace.

The globally unique ID of the newly triggered run.

Full Access

Question # 34

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.

One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.

Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.

Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.

Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

Full Access

Question # 35

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.

A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.

The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.

An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.

An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Full Access

Question # 36

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?

All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.

Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.

Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.

Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

Full Access

Summer Sale - Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dpt65

DumpsTool Header

dumpstool logo

Databricks-Certified-Professional-Data-Engineer Questions and Answers

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Quick Links

Why Us

Updated Exams

Site Secure

Footer