A Professional Data Engineer enables data-driven decision making by collecting, transforming, and visualizing data. The Data Engineer designs, builds, maintains, and troubleshoots data processing systems with a particular emphasis on the security, reliability, fault-tolerance, scalability, fidelity, and efficiency of such systems.
The Data Engineer also analyzes data to gain insight into business outcomes, builds statistical models to support decision-making, and creates machine learning models to automate and simplify key business processes.
The Google Cloud Certified – Professional Data Engineer exam assesses your ability to:
- Build and maintain data structures and databases
- Design data processing systems
- Analyze data and enable machine learning
- Model business processes for analysis and optimization
- Design for reliability
- Visualize data and advocate policy
- Design for security and compliance
About this certification exam
This exam objectively measures an individual’s ability to demonstrate the critical job skills for the role. To earn this certification you must pass the Professional Data Engineer exam. The format is multiple choice and multiple select. The exam has no prerequisites. This exam must be taken in-person at one of our testing center locations.
- Length: 2 hours
- Registration fee: USD $200
- Language: English, Japanese, Spanish, Portuguese
Section 1: Designing data processing systems
1.1 Designing flexible data representations. Considerations include:
- future advances in data technology
- changes to business requirements
- awareness of current state and how to migrate the design to a future state
- data modeling
- tradeoffs
- distributed systems
- schema design
1.2 Designing data pipelines. Considerations include:
- future advances in data technology
- changes to business requirements
- awareness of current state and how to migrate the design to a future state
- data modeling
- tradeoffs
- system availability
- distributed systems
- schema design
- common sources of error (eg. removing selection bias)
1.3 Designing data processing infrastructure. Considerations include:
- future advances in data technology
- changes to business requirements
- awareness of current state, how to migrate the design to the future state
- data modeling
- tradeoffs
- system availability
- distributed systems
- schema design
- capacity planning
- different types of architectures: message brokers, message queues, middleware, service-oriented
Section 2: Building and maintaining data structures and databases
2.1 Building and maintaining flexible data representations
2.2 Building and maintaining pipelines. Considerations include:
- data cleansing
- batch and streaming
- transformation
- acquire and import data
- testing and quality control
- connecting to new data sources
2.3 Building and maintaining processing infrastructure. Considerations include:
- provisioning resources
- monitoring pipelines
- adjusting pipelines
- testing and quality control
Section 3: Analyzing data and enabling machine learning
3.1 Analyzing data. Considerations include:
- data collection and labeling
- data visualization
- dimensionality reduction
- data cleaning/normalization
- defining success metrics
3.2 Machine learning. Considerations include:
- feature selection/engineering
- algorithm selection
- debugging a model
3.3 Machine learning model deployment. Considerations include:
- performance/cost optimization
- online/dynamic learning
Section 4: Modeling business processes for analysis and optimization
4.1 Mapping business requirements to data representations. Considerations include:
- working with business users
- gathering business requirements
4.2 Optimizing data representations, data infrastructure performance and cost. Considerations include:
- resizing and scaling resources
- data cleansing, distributed systems
- high performance algorithms
- common sources of error (eg. removing selection bias)
Section 5: Ensuring reliability
5.1 Performing quality control. Considerations include:
- verification
- building and running test suites
- pipeline monitoring
5.2 Assessing, troubleshooting, and improving data representations and data processing infrastructure.
5.3 Recovering data. Considerations include:
- planning (e.g. fault-tolerance)
- executing (e.g., rerunning failed jobs, performing retrospective re-analysis)
- stress testing data recovery plans and processes
Section 6: Visualizing data and advocating policy
6.1 Building (or selecting) data visualization and reporting tools. Considerations include:
- automation
- decision support
- data summarization, (e.g, translation up the chain, fidelity, trackability, integrity)
6.2 Advocating policies and publishing data and reports.
Section 7: Designing for security and compliance
7.1 Designing secure data infrastructure and processes. Considerations include:
- Identity and Access Management (IAM)
- data security
- penetration testing
- Separation of Duties (SoD)
- security control
7.2 Designing for legal compliance. Considerations include:
- legislation (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), etc.)
- audits