Databricks-Certified-Professional-Data-Engineer Practice Exam Questions and Answers

Databricks Certified Data Engineer Professional Exam

Last Update 3 days ago
Total Questions : 96

Databricks-Certified-Professional-Data-Engineer is stable now with all latest exam questions are added 3 days ago. Just download our Full package and start your journey with Databricks Certified Data Engineer Professional Exam certification. All these Databricks Databricks-Certified-Professional-Data-Engineer practice exam questions are real and verified by our Experts in the related industry fields.

Databricks-Certified-Professional-Data-Engineer PDF

$48
~~$119.99~~

Add to Cart

Databricks-Certified-Professional-Data-Engineer Testing Engine

$56
~~$139.99~~

Add to Cart

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine

$70.8
~~$176.99~~

Add to Cart

Question # 1

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

Options:

Task queueing resulting from improper thread pool assignment.

Spill resulting from attached volume storage being too small.

Network latency due to some cluster nodes being in different regions from the source data

Skew caused by more data being assigned to a subset of spark-partitions.

Credential validation errors while pulling data from an external system.

Discussion 1

Question # 2

The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalidlatitudeandlongitudevalues in theactivity_detailstable have been breaking their ability to use other geolocation processes.

A junior engineer has written the following code to addCHECKconstraints to the Delta Lake table:

A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.

Which statement explains the cause of this failure?

Options:

Because another team uses this table to support a frequently running application, two-phase locking is preventing the operation from committing.

The activity details table already exists; CHECK constraints can only be added during initial table creation.

The activity details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.

The activity details table already contains records; CHECK constraints can only be added prior to inserting values into a table.

The current table schema does not contain the field valid coordinates; schema evolution will need to be enabled before altering the table to add a constraint.

Discussion 0

Question # 3

Which statement describes integration testing?

Options:

Validates interactions between subsystems of your application

Requires an automated testing framework

Requires manual intervention

Validates an application use case

Validates behavior of individual elements of your application

Discussion 0

Question # 4

Which statement describes Delta Lake Auto Compaction?

Options:

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 G

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 M

Discussion 0

Question # 5

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.

Which statement describes a main benefit that offset this additional effort?

Options:

Improves the quality of your data

Validates a complete use case of your application

Troubleshooting is easier since all steps are isolated and tested individually

Yields faster deployment and execution times

Ensures that all steps interact correctly to achieve the desired end result

Discussion 0

Question # 6

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Options:

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Discussion 0

Question # 7

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, usingdisplay()calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Discussion 0

Question # 8

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Options:

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Discussion 0

Question # 9

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

Options:

spark.sql.files.maxPartitionBytes

spark.sql.autoBroadcastJoinThreshold

spark.sql.files.openCostInBytes

spark.sql.adaptive.coalescePartitions.minPartitionNum

spark.sql.adaptive.advisoryPartitionSizeInBytes

Discussion 0

Question # 10

A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

Theuser_ltvtable has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:

Three columns will be returned, but one column will be named "redacted" and contain only null values.

Only the email and itv columns will be returned; the email column will contain all null values.

The email and ltv columns will be returned with the values in user itv.

The email, age. and ltv columns will be returned with the values in user ltv.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Discussion 0

Question # 11

Which of the following is true of Delta Lake and the Lakehouse?

Options:

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Z-order can only be applied to numeric values stored in Delta Lake tables

Discussion 0

Question # 12

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

Options:

/jobs/runs/list

/jobs/runs/get-output

/jobs/runs/get

/jobs/get

/jobs/list

Discussion 0

Question # 13

Which distribution does Databricks support for installing custom Python code packages?

Options:

sbt

CRAN

CRAM

nom

Wheels

jars

Discussion 0

Question # 14

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.

Which statement describes a limitation of Databricks Secrets?

Options:

Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.

Account administrators can see all secrets in plain text by loggingon to the Databricks Accounts console.

Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.

Iterating through a stored secret and printing each character will display secret contents in plain text.

The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.

Discussion 0

Question # 15

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Options:

All records are cached to an operational database and then the filter is applied

The Parquet file footers are scanned for min and max statistics for the latitude column

All records are cached to attached storage and then the filter is applied

The Delta log is scanned for min and max statistics for the latitude column

The Hive metastore is scanned for min and max statistics for the latitude column

Discussion 0

Question # 16

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:

df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

Options:

date = spark.conf.get("date")

input_dict = input()

date= input_dict["date"]

import sys

date = sys.argv[1]

date = dbutils.notebooks.getParam("date")

dbutils.widgets.text("date", "null")

date = dbutils.widgets.get("date")

Discussion 0

Question # 17

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. Theuser_idfield represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table namedaccount_historywhich maintains a full record of all data in the same schema as the source. The next table in the system is namedaccount_currentand is implemented as a Type 1 table representing the most recent value for each uniqueuser_id.

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the describedaccount_currenttable as part of each hourly batch job?

Options:

Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.

Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.

Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.

Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.

Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the

most recent value for each username.

Discussion 0

Question # 18

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

Options:

Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Discussion 0

Get Databricks-Certified-Professional-Data-Engineer dumps and pass your exam in 24 hours!

Labour Day Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 2493360325

Good News !!! Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam is now Stable and With Pass Result

Databricks-Certified-Professional-Data-Engineer Practice Exam Questions and Answers

Databricks-Certified-Professional-Data-Engineer PDF

Databricks-Certified-Professional-Data-Engineer Testing Engine

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Free Exams Sample Questions

We Accept

Secure Site

Customer Review

Money Back Guarantee