Month End Sale - Special 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70dumps

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and Answers

Question # 6

A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.

Which operation is AQE implementing to improve performance?

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Full Access
Question # 7

A data engineer is working on the DataFrame:

(Referring to the table image: it has columnsId,Name,count, andtimestamp.)

Which code fragment should the engineer use to extract the unique values in theNamecolumn into an alphabetically ordered list?

A.

df.select("Name").orderBy(df["Name"].asc())

B.

df.select("Name").distinct().orderBy(df["Name"])

C.

df.select("Name").distinct()

D.

df.select("Name").distinct().orderBy(df["Name"].desc())

Full Access
Question # 8

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

A.

Use an RDD action like reduce() to compute the maximum time

B.

Use an accumulator to record the maximum time on the driver

C.

Broadcast a variable to share the maximum time among workers

D.

Configure the Spark UI to automatically collect maximum times

Full Access
Question # 9

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib

Full Access
Question # 10

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:

A.

users.write(path="/some/path").saveAsTable("default_table")

B.

users.write.saveAsTable("default_table").option("path", "/some/path")

C.

users.write.option("path", "/some/path").saveAsTable("default_table")

D.

users.write.saveAsTable("default_table", path="/some/path")

Full Access
Question # 11

A Data Analyst needs to retrieve employees with 5 or more years of tenure.

Which code snippet filters and shows the list?

A.

employees_df.filter(employees_df.tenure >= 5).show()

B.

employees_df.where(employees_df.tenure >= 5)

C.

filter(employees_df.tenure >= 5)

D.

employees_df.filter(employees_df.tenure >= 5).collect()

Full Access
Question # 12

A data engineer writes the following code to join two DataFramesdf1anddf2:

df1 = spark.read.csv("sales_data.csv") # ~10 GB

df2 = spark.read.csv("product_data.csv") # ~8 MB

result = df1.join(df2, df1.product_id == df2.product_id)

Which join strategy will Spark use?

A.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan

B.

Broadcast join, as df2 is smaller than the default broadcast threshold

C.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently

D.

Shuffle join because no broadcast hints were provided

Full Access
Question # 13

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

A.

Execute their pyspark shell with the option--remote "https://localhost"

B.

Execute their pyspark shell with the option--remote "sc://localhost"

C.

Set the environment variableSPARK_REMOTE="sc://localhost"before starting the pyspark shell

D.

Add.remote("sc://localhost")to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark propertyspark.connect.grpc.binding.portis set to 15002 in the application code

Full Access
Question # 14

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

A.

Configure the application to run in cluster mode instead of local mode.

B.

Increase the number of local threads based on the number of CPU cores.

C.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.

D.

Set the spark.executor.memory property to a large value.

Full Access
Question # 15

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set forspark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

A.

A Cartesian join

B.

A shuffled hash join

C.

A broadcast nested loop join

D.

A sort-merge join

Full Access
Question # 16

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

A.

Convert the Pandas UDF to a PySpark UDF

B.

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF

C.

Run thein_spanish_inner()function in amapInPandas()function call

D.

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

Full Access
Question # 17

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns fortransaction_id,account_number,transaction_amount, andtimestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

A.

df = df.dropDuplicates()

B.

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

C.

df = df.filter(F.col("transaction_id").isNotNull())

D.

df = df.dropDuplicates(["transaction_amount"])

Full Access
Question # 18

Which Spark configuration controls the number of tasks that can run in parallel on the executor?

Options:

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.driver.cores

D.

spark.executor.memory

Full Access
Question # 19

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

A.

[Row(name='bambi'), Row(name='alladin', age=20)]

B.

[Row(name='alladin', age=20)]

C.

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

D.

The code throws an error due to a schema mismatch.

Full Access
Question # 20

A data engineer wants to process a streaming DataFrame that receives sensor readings every second with columnssensor_id,temperature, andtimestamp. The engineer needs to calculate the average temperature for each sensor over the last 5 minutes while the data is streaming.

Which code implementation achieves the requirement?

Options from the images provided:

A)

B)

C)

D)

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Full Access
Question # 21

A data engineer is working on a real-time analytics pipeline using Apache Spark Structured Streaming. The engineer wants to process incoming data and ensure that triggers control when the query is executed. The system needs to process data in micro-batches with a fixed interval of 5 seconds.

Which code snippet the data engineer could use to fulfil this requirement?

A)

B)

C)

D)

Options:

A.

Uses trigger(continuous='5 seconds') – continuous processing mode.

B.

Uses trigger() – default micro-batch trigger without interval.

C.

Uses trigger(processingTime='5 seconds') – correct micro-batch trigger with interval.

D.

Uses trigger(processingTime=5000) – invalid, as processingTime expects a string.

Full Access
Question # 22

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region-> region id containing the smallest 3 region_idvalues.

Which code fragment meets the requirements?

A)

B)

C)

D)

The resulting Python dictionary must contain a mapping ofregion -> region_idfor the smallest 3region_idvalues.

Which code fragment meets the requirements?

A.

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

B.

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

C.

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

D.

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

Full Access
Question # 23

A data analyst wants to add a column date derived from a timestamp column.

Options:

A.

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

B.

dates_df.withColumn("date", f.to_date("timestamp")).show()

C.

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

D.

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()

Full Access
Question # 24

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFramedfwith columnsuser_id,product_id, andpurchase_amountand needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Full Access
Question # 25

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

Full Access