Google Professional Machine Learning Engineer Exam Dumps & Practice Test Questions
Question 1:
You are developing a machine learning pipeline to identify anomalies in sensor data as it streams in real-time. Incoming data is handled via Pub/Sub. You need to store results for analytics and dashboarding purposes.
Which pipeline configuration is most appropriate?
A. 1 = Dataflow, 2 = AI Platform, 3 = BigQuery
B. 1 = DataProc, 2 = AutoML, 3 = Cloud Bigtable
C. 1 = BigQuery, 2 = AutoML, 3 = Cloud Functions
D. 1 = BigQuery, 2 = AI Platform, 3 = Cloud Storage
Answer: A
Explanation:
Designing a real-time anomaly detection pipeline on Google Cloud involves three critical stages: data ingestion and processing, applying machine learning inference, and storing data for further analysis. Option A uses a combination of services that best fits these needs.
First, Dataflow is a fully managed, scalable service ideal for processing streaming data in real time. It integrates seamlessly with Pub/Sub, the messaging service that collects the incoming sensor data. Dataflow can clean, filter, and transform the raw sensor streams, and can even invoke ML models for inference inline. This makes it perfect for anomaly detection on live data.
Second, the AI Platform (or Vertex AI) handles the machine learning workload. After training an anomaly detection model on historical data, you deploy it here for serving predictions. Dataflow sends data points to the AI Platform model endpoint to get anomaly scores, enabling real-time detection.
Third, BigQuery acts as the analytics data warehouse. It’s optimized for running complex queries on large datasets, making it excellent for storing both raw sensor data and anomaly detection results. Data analysts and visualization tools (such as Looker Studio or Tableau) can then query BigQuery to generate reports, track trends, and explore anomalies over time.
Why are other options less suitable? DataProc (Option B) is better for batch jobs rather than real-time streaming. AutoML is a convenient tool but isn’t designed for continuous, low-latency inference. Cloud Bigtable is a NoSQL database designed for high throughput, not analytical querying. BigQuery as an ingestion layer (Options C and D) doesn’t handle real-time streaming well. Cloud Functions are lightweight and event-driven but not meant for complex, continuous ML scoring.
In conclusion, the combination of Dataflow for real-time processing, AI Platform for ML inference, and BigQuery for storage and analytics (Option A) offers a scalable, efficient, and analytics-ready solution for anomaly detection in streaming sensor data.
Question 2:
Your company operates an internal shuttle service with stops throughout the city every 30 minutes from 7 AM to 10 AM. A Kubernetes-based app requires users to confirm their pickup station one day before.
What is the best strategy to optimize shuttle routing?
A. Develop a tree-based regression model predicting passenger counts at each stop, then dispatch shuttles accordingly.
B. Create a tree-based classification model predicting whether the shuttle should stop at each station, then dispatch accordingly.
C. Formulate the optimal route as the shortest path covering all stations with confirmed riders under capacity limits, then dispatch and map stops accordingly.
D. Use reinforcement learning combined with tree-based models to predict passenger presence and optimize routes, then dispatch shuttles accordingly.
Answer: C
Explanation:
This situation involves optimizing shuttle routes based on confirmed passenger attendance, which is already provided by the user confirmations from the app. Since the number and location of passengers are deterministic (known beforehand), the problem becomes one of route optimization rather than prediction.
Option C correctly identifies the problem as a variation of the Vehicle Routing Problem (VRP) or Traveling Salesman Problem (TSP) with capacity constraints. The solution is to compute the shortest route that visits all shuttle stops where passengers have confirmed attendance, ensuring the shuttle capacity matches passenger count. This approach reduces unnecessary stops, fuel consumption, and travel time, improving operational efficiency.
Option A suggests predicting passenger numbers via regression. However, since actual attendance is already confirmed, this prediction is redundant and introduces unnecessary complexity. Option B’s classification model predicting whether a shuttle should stop is similarly unneeded, as the app already provides exact stop information.
Option D proposes a reinforcement learning (RL) approach, which involves complex sequential decision-making and learning through trial and error. While RL is powerful in uncertain or dynamic environments, it is an overengineered solution here because passenger presence is known upfront and static for the scheduling period.
Using deterministic route optimization algorithms (like Google OR-Tools) provides a straightforward, interpretable, and computationally efficient solution. It guarantees the shortest path covering all required stops under capacity constraints, making the shuttle service both cost-effective and timely.
Therefore, Option C is the most logical and efficient solution since it leverages confirmed data directly and uses well-established optimization techniques to solve the routing problem.
Question 3:
You are tasked with analyzing sensor data from a production line to investigate component failures. The dataset shows that failure events (positive examples) make up less than 1% of the data. After trying to train various classification models, none of them converge properly.
How should you address this class imbalance issue to improve model training?
A. Adjust the dataset to generate 10% positive examples based on the class distribution.
B. Apply a convolutional neural network with max pooling and softmax activation.
C. Use downsampling combined with upweighting to create a training set with 10% positive examples.
D. Remove negative examples until the numbers of positive and negative samples are equal.
Answer: C
Explanation:
This problem illustrates a classic case of extreme class imbalance, where the failure events (minority class) are vastly outnumbered by normal operation data (majority class). In such situations, standard classification models tend to ignore the rare class because they achieve high accuracy by simply predicting the dominant class. This results in poor detection of failure cases, which defeats the purpose of the analysis.
The key challenge is to help the model focus sufficiently on the minority class without losing important information from the majority class. Option C is the best approach because it combines two effective strategies:
Downsampling: By reducing the number of majority class (non-failure) examples, the training set becomes more balanced. This prevents the model from being overwhelmed by negative examples while still preserving a representative subset.
Upweighting: This technique increases the importance of the minority class samples during training by assigning them higher weights. It helps the model “pay more attention” to failure examples even though they are fewer in number.
Together, these techniques create a more balanced dataset (approximately 10% failures) without discarding too much valuable data or causing overfitting. This balance allows the model to converge and learn meaningful patterns for failure detection.
Other options are less effective:
A is vague and risks generating synthetic data without clear methodology, potentially causing overfitting or skewed representations.
B suggests using CNNs, which are better suited for image or spatial data, and does not directly solve class imbalance.
D involves excessive removal of negative samples, losing useful information and harming the model’s generalization ability.
In summary, the combined use of downsampling with upweighting (Option C) is a practical and robust method to address severe class imbalance and improve model training for rare event detection.
Question 4:
You want to redesign your machine learning pipeline for structured data on Google Cloud. Currently, you use PySpark for large-scale data transformation, but your pipelines take over 12 hours to complete. You want to switch to a serverless solution that supports SQL syntax to speed up processing. Your raw data is already stored in Cloud Storage.
What is the best approach to build this pipeline?
A. Use Cloud Data Fusion’s GUI to create transformation pipelines, then write the output to BigQuery.
B. Convert PySpark scripts to SparkSQL and run them on Dataproc, outputting to BigQuery.
C. Load data into Cloud SQL, rewrite transformations as SQL queries, and use BigQuery federated queries for ML.
D. Load data into BigQuery using BigQuery Load jobs, convert PySpark logic to BigQuery SQL queries, and store results in new tables.
Answer: D
Explanation:
This question addresses the need to optimize a machine learning pipeline by moving from long-running PySpark transformations to a faster, serverless, SQL-based solution on Google Cloud.
Currently, PySpark running on a Spark cluster takes too long (over 12 hours), signaling inefficiencies in cluster management and resource usage. Your goal is to adopt a serverless tool that supports SQL to speed up development and processing time, leveraging your existing data stored in Cloud Storage.
Here’s why Option D is the best fit:
Serverless and Scalable: BigQuery is a fully managed, serverless data warehouse that automatically scales and handles parallel processing of large datasets without the need to manage infrastructure.
Performance: BigQuery’s distributed architecture can process massive amounts of data quickly, dramatically reducing pipeline runtimes compared to Spark running on Dataproc.
SQL Syntax: BigQuery uses standard SQL, allowing you to translate your PySpark data transformation logic into SQL queries, making your pipeline easier to maintain and more accessible.
Direct Integration: Loading raw data from Cloud Storage directly into BigQuery is straightforward and efficient, avoiding intermediate steps or additional services.
Machine Learning Integration: BigQuery ML lets you build and train ML models using SQL syntax directly inside BigQuery, simplifying your workflow and enabling seamless integration with Vertex AI or other ML services.
Other options are less suitable:
A (Data Fusion) relies on Dataproc clusters behind the scenes, which does not eliminate the Spark ecosystem’s overhead and complexity.
B still depends on Dataproc, which is not serverless and requires cluster management, limiting speed improvements.
C involves Cloud SQL, which is not optimized for large-scale analytical workloads and would introduce unnecessary overhead and latency through federated queries.
In conclusion, Option D best meets the requirements by leveraging BigQuery’s serverless, SQL-based processing to create a faster, more efficient ML pipeline on Google Cloud.
Question 5:
You lead a group of data scientists who submit training jobs to a cloud backend system. However, managing this system has become overly complex. The team uses multiple machine learning frameworks like Keras, PyTorch, Theano, Scikit-learn, and some custom libraries.
To simplify administration, you want to switch to a managed service. What approach should you take?
A. Use AI Platform’s custom containers feature to accept training jobs built with any framework.
B. Deploy Kubeflow on Google Kubernetes Engine and accept training jobs via TF Job.
C. Build a collection of VM images on Compute Engine and share them in a centralized repository.
D. Implement the Slurm workload manager to schedule and run jobs on your cloud infrastructure.
Answer: A
Explanation:
In this situation, the main priorities are to reduce system management overhead by using a managed service while supporting a wide variety of machine learning frameworks, including custom code libraries. Let’s analyze the options against these priorities:
Option A is the best solution because AI Platform (now part of Vertex AI) allows the use of custom containers. This feature lets you fully define the training environment, including any ML framework (Keras, PyTorch, Theano, etc.) and required dependencies, by packaging everything inside a Docker container. Your team can submit training jobs using these containers, and the managed service will handle orchestration, scaling, and lifecycle management automatically. This reduces administrative burden significantly since you don’t need to maintain your own infrastructure.
Option B involves deploying Kubeflow on GKE. Although Kubeflow offers advanced ML orchestration capabilities, it is not a managed service, so you would still be responsible for setting up and maintaining Kubernetes clusters. Kubeflow’s TFJob is designed primarily for TensorFlow workloads and may not natively support all frameworks your team uses. This approach contradicts the goal to simplify management.
Option C suggests creating VM images with pre-installed environments. While flexible, it doesn’t provide the managed experience and requires ongoing manual management of VM lifecycles and orchestration, which is what you want to avoid.
Option D recommends Slurm, an open-source job scheduler used mostly in HPC environments. Setting up and maintaining Slurm in the cloud can be complex and is not managed, making it less suitable here.
In conclusion, Option A offers a scalable, flexible, and fully managed solution that accommodates diverse ML frameworks without increasing your administrative workload.
Question 6:
You are working for an online retail company that has developed an end-to-end machine learning pipeline on Google Cloud to classify images containing your products. You anticipate launching new products soon, so you have implemented automatic model retraining to incorporate new data. To maintain high model accuracy, you want to leverage AI Platform’s continuous evaluation service.
How should you manage your test dataset?
A. Keep the original test dataset unchanged, even after adding new products for retraining.
B. Add images of the new products to your test dataset when they are introduced during retraining.
C. Replace the existing test dataset entirely with images of the new products after retraining.
D. Update the test dataset with new product images only when evaluation metrics fall below a specific threshold.
Answer: B
Explanation:
In machine learning pipelines, especially those involving evolving product lines like your visual search engine, it’s crucial that the test dataset remains representative of all products the model is expected to classify. Continuous evaluation on an up-to-date test set ensures that your model’s performance metrics accurately reflect its ability to handle real-world data.
Option B is the best practice because extending the test dataset with images of newly introduced products ensures that the evaluation process covers both existing and new product categories. This approach maintains backward compatibility, allowing you to monitor whether the model continues to perform well on older products while validating its effectiveness on the new ones. Consequently, your evaluation results remain comprehensive, preventing blind spots in model accuracy.
Option A is not advisable since keeping the test dataset static ignores new product classes. As a result, the model might underperform on new products without this being detected, misleading stakeholders about model quality.
Option C — replacing the test dataset entirely with new product images — loses historical performance data and prevents monitoring of legacy product accuracy. In practice, backward compatibility is critical to ensure consistent user experience.
Option D is reactive, updating the test set only after a performance drop. This delays detection of issues and risks silent failures, especially problematic in dynamic retail environments where new products are frequently introduced.
In summary, proactively expanding your test dataset as new products are added (Option B) keeps continuous evaluation effective and reliable, ensuring your ML models stay accurate across your entire product range.
Question 7:
You want to create classification workflows on multiple structured datasets stored in BigQuery. Since the classification will be performed repeatedly, you need to handle exploratory data analysis, feature selection, model building, training, hyperparameter tuning, and deployment without writing any code.
Which solution best fits your needs?
A. Use AutoML Tables to handle the classification task.
B. Execute a BigQuery ML logistic regression job for classification.
C. Utilize AI Platform Notebooks with pandas for classification modeling.
D. Launch an AI Platform training job configured for hyperparameter tuning.
Answer: A
Explanation:
The requirement here is to build a classification pipeline involving exploratory data analysis (EDA), feature selection, model training, hyperparameter tuning, and serving—all without coding. This means the solution must provide a fully automated, user-friendly experience, ideally with a graphical interface, tightly integrated with BigQuery datasets.
AutoML Tables (Option A) is specifically designed for no-code machine learning workflows on structured data. It supports automatic EDA by generating summaries and visualizations of your data, performs feature engineering and selection automatically, builds various models suitable for classification, and tunes hyperparameters behind the scenes to optimize performance. Moreover, AutoML Tables can deploy trained models as REST endpoints for easy serving, eliminating the need for any code-based deployment. Since it integrates seamlessly with BigQuery, there’s no need for data export or conversion, facilitating rapid workflow setup.
In contrast, BigQuery ML (Option B) is a powerful tool that enables model building using SQL queries directly within BigQuery. However, it requires users to write SQL code and doesn’t offer a graphical interface or automated EDA and feature selection, thus failing the no-code criterion.
AI Platform Notebooks (Option C) involve writing Python code and using libraries like pandas and scikit-learn or TensorFlow for model creation and training. This approach is flexible but violates the no-code requirement since it requires programming knowledge.
AI Platform training jobs with hyperparameter tuning (Option D) allow for sophisticated model training and tuning but demand custom code for job configuration, submission, and orchestration, so it also fails the no-code requirement.
Therefore, AutoML Tables stands out as the best fit to accomplish an end-to-end classification pipeline on BigQuery data without coding, making option A the clear answer.
Question 8:
You are tasked with building a predictive model that estimates delay times across multiple public transportation routes, with predictions served to users in real time via an app. Since seasonal changes and population growth affect the data, the model needs retraining every month. Following Google Cloud’s best practices, which architecture should you implement for this end-to-end solution?
A. Use Kubeflow Pipelines to schedule and manage your workflow from training through deployment.
B. Train and deploy the model with BigQuery ML, retraining it via scheduled queries.
C. Trigger training and deployment jobs on AI Platform with Cloud Functions, initiated by Cloud Scheduler.
D. Use Cloud Composer to orchestrate a Dataflow job that runs the full pipeline from training to deployment.
Answer: A
Explanation:
The problem calls for a fully automated, scalable, and maintainable machine learning pipeline that includes data processing, model training, evaluation, deployment, and retraining every month—delivering real-time predictions in an app environment. Additionally, the solution should align with Google’s recommended best practices for ML workflows.
Kubeflow Pipelines (Option A) is a Google Cloud–endorsed framework designed precisely for managing multi-step machine learning workflows. It supports defining modular pipeline components for preprocessing, training, tuning, evaluation, and deployment, allowing you to automate and schedule retraining on a fixed cadence, such as monthly. Kubeflow integrates tightly with Vertex AI, enabling scalable training and real-time model serving endpoints. It also facilitates pipeline versioning, monitoring, and reproducibility, which are critical in production ML systems. Hence, Kubeflow Pipelines is an excellent choice for orchestrating complex, repeatable ML workflows in a cloud-native, no-code or low-code manner.
BigQuery ML (Option B) allows building and training models with SQL on BigQuery datasets and supports scheduled queries for retraining. However, BigQuery ML lacks native support for complex pipeline orchestration, real-time deployment to low-latency endpoints, and advanced model lifecycle management, making it less ideal for this use case.
Cloud Functions triggered by Cloud Scheduler (Option C) can automate training and deployment jobs, but this approach tends to be less scalable and harder to maintain as the pipeline grows. Hardcoded scripts reduce flexibility and observability, which complicates debugging and governance in production ML workflows.
Cloud Composer (Option D) is a managed Apache Airflow service mainly intended for orchestrating data pipelines. While powerful, it is better suited to ETL and batch workflows rather than end-to-end ML lifecycle management. Using Dataflow within Cloud Composer for training and deployment is unconventional, as Dataflow primarily handles stream and batch data processing, not ML orchestration.
In conclusion, Kubeflow Pipelines (Option A) offers the most robust, scalable, and maintainable architecture following Google’s best practices for a monthly retraining, real-time inference ML system.
You are building machine learning models on Google AI Platform for segmenting images in CT scans. Your model architectures evolve frequently based on the latest research, requiring repeated training on the same dataset to compare performance.
To reduce compute expenses and manual work while keeping track of code versions, which approach should you adopt?
A. Use Cloud Functions to detect code changes in Cloud Storage and trigger training jobs.
B. Use the gcloud CLI to submit AI Platform training jobs whenever code changes occur.
C. Link Cloud Build with Cloud Source Repositories to automatically trigger retraining upon code commits.
D. Set up a Cloud Composer workflow that runs daily to check for code changes in Cloud Storage using a sensor.
Correct Answer: C
Explanation:
When developing machine learning models that require frequent updates and retraining, an automated and cost-efficient workflow that incorporates version control is essential. Option C — using Cloud Build integrated with Cloud Source Repositories — is the most effective solution in this scenario.
Cloud Build is a fully managed CI/CD service within Google Cloud that can automatically initiate workflows based on triggers such as new commits to a Cloud Source Repository or GitHub. By connecting Cloud Build to your repository, retraining jobs on AI Platform can start automatically whenever code is pushed. This automation reduces manual overhead, eliminating the need to submit training jobs manually and ensuring consistent and timely retraining.
Additionally, Cloud Source Repositories provide native version control, allowing you to track changes to your model code systematically, a critical factor for reproducibility and benchmarking. This integration ensures that training is triggered only when code changes occur, which helps minimize compute costs.
The other options have drawbacks:
A: Cloud Functions can react to changes in Cloud Storage, but they are better suited for lightweight, event-driven tasks, not heavy ML training jobs. Also, they lack native version control capabilities.
B: Using the gcloud CLI is manual and error-prone, requiring constant user intervention for each code update, leading to inefficiency.
D: Cloud Composer is powerful for orchestrating complex workflows but adds unnecessary complexity here. Running daily checks wastes resources and may delay retraining after code changes.
In summary, option C offers the best balance of automation, version control, and cost efficiency for your machine learning model retraining needs on Google Cloud.
Your team needs to train a model to classify images into one of three categories: driver's license, passport, or credit card. The dataset consists of 10,000 images of driver's licenses, 1,000 passports, and 1,000 credit cards, with labels ['drivers_license', 'passport', 'credit_card'].
Which loss function is most appropriate for training this model?
A. Categorical hinge
B. Binary cross-entropy
C. Categorical cross-entropy
D. Sparse categorical cross-entropy
Correct Answer: C
Explanation:
Selecting the correct loss function is crucial for the model to learn effectively. Since this problem involves multiclass classification with three distinct categories, the model must assign each input image exclusively to one class.
Categorical cross-entropy (option C) is the standard loss function used for such tasks. It measures the difference between the predicted probability distribution (typically produced by a softmax output layer) and the true distribution (one-hot encoded labels). This loss encourages the model to maximize the probability of the correct class while minimizing the probabilities of the incorrect classes.
Let’s consider why other options are less suitable:
A (Categorical hinge): This loss is primarily used with Support Vector Machines (SVMs) for multiclass problems and is less common in neural networks where probabilistic outputs are required. It is not ideal for deep learning classification tasks with softmax outputs.
B (Binary cross-entropy): This loss is designed for binary classification problems (two classes). Using it here would require multiple one-vs-all classifiers, which complicates the training and is less efficient for multiclass problems.
D (Sparse categorical cross-entropy): This is very similar to categorical cross-entropy but expects integer class labels (e.g., 0,1,2) rather than one-hot or string labels. Since your dataset uses string labels, you would need to preprocess them to integer form before using this loss. If that preprocessing is not done, this loss is not applicable.
In conclusion, since your labels are categorical strings representing mutually exclusive classes, categorical cross-entropy is the best fit. It effectively trains your model to output probability distributions and optimizes classification accuracy across the three classes.
Top Google Certification Exams
Site Search:
SPECIAL OFFER: GET 10% OFF
Pass your Exam with ExamCollection's PREMIUM files!
SPECIAL OFFER: GET 10% OFF
Use Discount Code:
MIN10OFF
A confirmation link was sent to your e-mail.
Please check your mailbox for a message from support@examcollection.com and follow the directions.
Download Free Demo of VCE Exam Simulator
Experience Avanset VCE Exam Simulator for yourself.
Simply submit your e-mail address below to get started with our interactive software demo of your free trial.