Exambible Professional-Data-Engineer Questions are updated and all Professional-Data-Engineer answers are verified by experts. Once you have completely prepared with our Professional-Data-Engineer exam prep kits you will be ready for the real Professional-Data-Engineer exam without a problem. We have Leading Google Professional-Data-Engineer dumps study guide. PASSED Professional-Data-Engineer First attempt! Here What I Did.

Check Professional-Data-Engineer free dumps before getting the full version:


You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones. What should you do?

  • A. Create an API using App Engine to receive and send messages to the applications
  • B. Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them
  • C. Create a table on Cloud SQL, and insert and delete rows with the job information
  • D. Create a table on Cloud Spanner, and insert and delete rows with the job information

Answer: A


Which is not a valid reason for poor Cloud Bigtable performance?

  • A. The workload isn't appropriate for Cloud Bigtable.
  • B. The table's schema is not designed correctly.
  • C. The Cloud Bigtable cluster has too many nodes.
  • D. There are issues with the network connection.

Answer: C

The Cloud Bigtable cluster doesn't have enough nodes. If your Cloud Bigtable cluster is overloaded, adding more nodes can improve performance. Use the monitoring tools to check whether the cluster is overloaded.
Reference: https://cloud.google.com/bigtable/docs/performance


Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

  • A. Field promotion
  • B. Randomization
  • C. Salting
  • D. Hashing

Answer: A

By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.


You need to deploy additional dependencies to all of a Cloud Dataproc cluster at startup using an existing initialization action. Company security policies require that Cloud Dataproc nodes do not have access to the
Internet so public initialization actions cannot fetch resources. What should you do?

  • A. Deploy the Cloud SQL Proxy on the Cloud Dataproc master
  • B. Use an SSH tunnel to give the Cloud Dataproc cluster access to the Internet
  • C. Copy all dependencies to a Cloud Storage bucket within your VPC security perimeter
  • D. Use Resource Manager to add the service account used by the Cloud Dataproc cluster to the Network User role

Answer: D


Why do you need to split a machine learning dataset into training data and test data?

  • A. So you can try two different sets of features
  • B. To make sure your model is generalized for more than just the training data
  • C. To allow you to create unit tests in your code
  • D. So you can use one dataset for a wide model and one for a deep model

Answer: B

The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data. A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely to have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specialized to the structure in the training dataset. This is called overfitting.
Reference: https://machinelearningmastery.com/a-simple-intuition-for-overfitting/


The marketing team at your organization provides regular updates of a segment of your customer dataset. The marketing team has given you a CSV with 1 million records that must be updated in BigQuery. When you use the UPDATE statement in BigQuery, you receive a quotaExceeded error. What should you do?

  • A. Reduce the number of records updated each day to stay within the BigQuery UPDATE DML statement limit.
  • B. Increase the BigQuery UPDATE DML statement limit in the Quota management section of the Google Cloud Platform Console.
  • C. Split the source CSV file into smaller CSV files in Cloud Storage to reduce the number of BigQuery UPDATE DML statements per BigQuery job.
  • D. Import the new records from the CSV file into a new BigQuery tabl
  • E. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table.

Answer: A


You need to choose a database to store time series CPU and memory usage for millions of computers. You need to store this data in one-second interval samples. Analysts will be performing real-time, ad hoc analytics against the database. You want to avoid being charged for every query executed and ensure that the schema design will allow for future growth of the dataset. Which database and data model should you choose?

  • A. Create a table in BigQuery, and append the new samples for CPU and memory to the table
  • B. Create a wide table in BigQuery, create a column for the sample value at each second, and update the row with the interval for each second
  • C. Create a narrow table in Cloud Bigtable with a row key that combines the Computer Engine computer identifier with the sample time at each second
  • D. Create a wide table in Cloud Bigtable with a row key that combines the computer identifier with the sample time at each minute, and combine the values for each second as column data.

Answer: D


You are responsible for writing your company’s ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?

  • A. PigLatin using Pig
  • B. HiveQL using Hive
  • C. Java using MapReduce
  • D. Python using MapReduce

Answer: D


You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? Choose 2 answers.

  • A. Denormalize the data as must as possible.
  • B. Preserve the structure of the data as much as possible.
  • C. Use BigQuery UPDATE to further reduce the size of the dataset.
  • D. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
  • E. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro fil
  • F. Use BigQuery’ssupport for external data sources to query.

Answer: DE


What are two of the characteristics of using online prediction rather than batch prediction?

  • A. It is optimized to handle a high volume of data instances in a job and to run more complex models.
  • B. Predictions are returned in the response message.
  • C. Predictions are written to output files in a Cloud Storage location that you specify.
  • D. It is optimized to minimize the latency of serving predictions.

Answer: BD

Online prediction
Optimized to minimize the latency of serving predictions. Predictions returned in the response message.
Batch prediction
Optimized to handle a high volume of instances in a job and to run more complex models. Predictions written to output files in a Cloud Storage location that you specify.


You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query – -dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries. What should you do?

  • A. Create a separate table for each ID.
  • B. Use the LIMIT keyword to reduce the number of rows returned.
  • C. Recreate the table with a partitioning column and clustering column.
  • D. Use the bq query - -maximum_bytes_billed flag to restrict the number of bytes billed.

Answer: B


You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You’ve collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?

  • A. Use Cloud Vision AutoML with the existing dataset.
  • B. Use Cloud Vision AutoML, but reduce your dataset twice.
  • C. Use Cloud Vision API by providing custom labels as recognition hints.
  • D. Train your own image recognition model leveraging transfer learning techniques.

Answer: A


What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?

  • A. create a third instance and sync the data from the two storage types via batch jobs
  • B. export the data from the existing instance and import the data into a new instance
  • C. run parallel instances where one is HDD and the other is SDD
  • D. the selection is final and you must resume using the same storage type

Answer: B

When you create a Cloud Bigtable instance and cluster, your choice of SSD or HDD storage for the cluster is permanent. You cannot use the Google Cloud Platform Console to change the type of storage that is used for the cluster.
If you need to convert an existing HDD cluster to SSD, or vice-versa, you can export the data from the existing instance and import the data into a new instance. Alternatively, you can write
a Cloud Dataflow or Hadoop MapReduce job that copies the data from one instance to another. Reference: https://cloud.google.com/bigtable/docs/choosing-ssd-hdd–


You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?

  • A. Use Cloud Bigtable for storag
  • B. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
  • C. Use Cloud Bigtable for storag
  • D. Link as permanent tables in BigQuery for query.
  • E. Use Cloud Storage for storag
  • F. Link as permanent tables in BigQuery for query.
  • G. Use Cloud Storage for storag
  • H. Link as temporary tables in BigQuery for query.

Answer: A


What are all of the BigQuery operations that Google charges for?

  • A. Storage, queries, and streaming inserts
  • B. Storage, queries, and loading data from a file
  • C. Storage, queries, and exporting data
  • D. Queries and streaming inserts

Answer: A

Google charges for storage, queries, and streaming inserts. Loading data from a file and exporting data are free operations.
Reference: https://cloud.google.com/bigquery/pricing


You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once, and must be ordered within windows of 1 hour. How should you design the solution?

  • A. Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
  • B. Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
  • C. Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
  • D. Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.

Answer: C


You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

  • A. Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP typ
  • B. Reload the data.
  • C. Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numericvalues from the column TS for each ro
  • D. Reference the column TS instead of the column DT from now on.
  • E. Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP value
  • F. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
  • G. Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN typ
  • H. Reload all data in append mod
  • I. For each appended row, set the value of IS_NEW to tru
  • J. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
  • K. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP value
  • L. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP typ
  • M. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now o
  • N. In the future, new data is loaded into the table NEW_CLICK_STREAM.

Answer: D


You are designing a cloud-native historical data processing system to meet the following conditions:
Professional-Data-Engineer dumps exhibit The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
Professional-Data-Engineer dumps exhibit A streaming data pipeline stores new data daily.
Professional-Data-Engineer dumps exhibit Peformance is not a factor in the solution.
Professional-Data-Engineer dumps exhibit The solution design should maximize availability.
How should you design data storage for this solution?

  • A. Create a Cloud Dataproc cluster with high availabilit
  • B. Store the data in HDFS, and peform analysis as needed.
  • C. Store the data in BigQuer
  • D. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
  • E. Store the data in a regional Cloud Storage bucke
  • F. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
  • G. Store the data in a multi-regional Cloud Storage bucke
  • H. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.

Answer: C


When a Cloud Bigtable node fails, is lost.

  • A. all data
  • B. no data
  • C. the last transaction
  • D. the time dimension

Answer: B

A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost Reference: https://cloud.google.com/bigtable/docs/overview


You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

  • A. There are very few occurrences of mutations relative to normal samples.
  • B. There are roughly equal occurrences of both normal and mutated samples in the database.
  • C. You expect future mutations to have different features from the mutated samples in the database.
  • D. You expect future mutations to have similar features to the mutated samples in the database.
  • E. You already have labels for which samples are mutated and which are normal in the database.

Answer: BC


You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?

  • A. The current epoch time
  • B. A concatenation of the product name and the current epoch time
  • C. A random universally unique identifier number (version 4 UUID)
  • D. The original order identification number from the sales system, which is a monotonically increasing integer

Answer: C


To run a TensorFlow training job on your own computer using Cloud Machine Learning Engine, what would your command start with?

  • A. gcloud ml-engine local train
  • B. gcloud ml-engine jobs submit training
  • C. gcloud ml-engine jobs submit training local
  • D. You can't run a TensorFlow program on your own computer using Cloud ML Engine .

Answer: A

gcloud ml-engine local train - run a Cloud ML Engine training job locally
This command runs the specified module in an environment similar to that of a live Cloud ML Engine Training Job.
This is especially useful in the case of testing distributed models, as it allows you to validate that you are
properly interacting with the Cloud ML Engine cluster configuration. Reference: https://cloud.google.com/sdk/gcloud/reference/ml-engine/local/train


You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

  • A. Cloud Scheduler
  • B. Cloud Dataflow
  • C. Cloud Functions
  • D. Cloud Composer

Answer: A


P.S. Easily pass Professional-Data-Engineer Exam with 239 Q&As DumpSolutions.com Dumps & pdf Version, Welcome to Download the Newest DumpSolutions.com Professional-Data-Engineer Dumps: https://www.dumpsolutions.com/Professional-Data-Engineer-dumps/ (239 New Questions)