Latest revision as of 07:44, 24 October 2017

Notes for Google Cloud Data Engineer (GCDE) certification. See GCDE.

Links:

Certification info: https://cloud.google.com/certification/data-engineer
Sample case study: https://cloud.google.com/certification/guides/data-engineer/casestudy-flowlogistic
Tutorials/Guides/Resources for all of Google Cloud: https://cloud.google.com/solutions/

Case Study

The GCDEC page gives an example of a case study that can be used to see how different parts of the Google Cloud platform come together in the kind of scenario a real company might face. The case study focuses on a logistics company that delivers packages and tracks the deliveries with servers, software, and other infrastructure already in-place. The company's goal is to improve their computational infrastructure by moving parts of it to the cloud, and implement the ability to predict late shipments.

Google Cloud/Case Study

Google Cloud Services

Notes on all of the various parts of the Google Cloud platform and the services available on it.

Introduction

Google Cloud for Big Data

MapReduce - can use Dataflow
Spark - can use Dataproc
BigQuery

Usage scenarios

Foundations

Compute and Storage

Data ingestion

Data storage

Federated analysis

Compute engine

Cloud storage

Data Analytics

Cloud SQL - relational database

Dataproc for machine learning

BigTop ecosystem:

Pig
Spark
Hive
Hadoop

Data Storage

Choosing a storage option: https://cloud.google.com/storage-options/

Data warehousing:

Bigtable - low-latency and updatable data warehouse solution, data is not highly structurable, no need to support ACID transactions
BigQuery - petabyte-scale, structured, column-major, SQL-queryable data warehouse solution

Data storage:

Cloud Storage - unstructured data (documents, sound files, PDFs, etc etc)
Cloud Datastore - non-relational (NoSQL), highly scalable storage solution; SQL-like query language; more restrictive queries (b/c optimized to be faster); supports ACID transactions
Cloud SQL - full SQL support and online transaction processing (OLTP) system
Cloud Spanner - (horizontally sharded SQL) fully managed mission-critical relational OLTP database that can scale horizontally to hundreds or thousands of servers to handle high workload transactions; supports ACID transactions

Writeup of Spanner: https://quizlet.com/blog/quizlet-cloud-spanner

Scaling Data Analysis

(Transformational use cases)

Datalab

Datastore

BigTable (fast random access, tradeoffs between consistency and availability)

BigQuery (query petabytes in seconds)

TensorFlow (distributed in the cloud over very large data sets)

Demand forecasting with machine learning

Data Processing Architectures

PubSub (messaging architecture)

Dataflow (way to execute code that processes streaming and batch data in similar ways)

Flags

@@ Line 6: / Line 6: @@
 * Tutorials/Guides/Resources for all of Google Cloud: https://cloud.google.com/solutions/
-==Goals and Motivation==
+==Case Study==
-Goals:
+The [[GCDEC]] page gives an example of a case study that can be used to see how different parts of the Google Cloud platform come together in the kind of scenario a real company might face. The case study focuses on a logistics company that delivers packages and tracks the deliveries with servers, software, and other infrastructure already in-place. The company's goal is to improve their computational infrastructure by moving parts of it to the cloud, and implement the ability to predict late shipments.
-* Implement real-time inventory tracking system that tracks locations
-* Perform data analytics on order and shipment logs (structured/unstructured data) to make decisions about deploying resources, targeting customers, and expanding into markets
-* Predict delays in shipments
-Requirements:
+[[Google Cloud/Case Study]]
-* Reliable, reproducible environment that scales
-* Aggregated data in centralized data lake
-* Historical data used to perform predictive analytics on future shipments
-* Accurate tracking of worldwide shipments (proprietary technology)
-* Improvement of business agility and speed of innovation via rapid provisioning of new resources
-* Analysis and optimization for performance in the cloud
-* Migration to cloud, if all other requirements met
-Deeper reasoning:
+==Google Cloud Services==
-* Inability to upgrade infrastructure hampering growth and efficiency
-* Ineffective at moving data around
-* Need to better understand where/who customers are, what they are shipping
-* IT is too busy managing infrastructure to organize data/build analytics/implement tracking technology
-* Penalties for late shipments and deliveries translates into direct correlation between profitability and bottom line
-==Technology Stack==
+Notes on all of the various parts of the Google Cloud platform and the services available on it.
-Databases:
+===Introduction===
-* SQL DB storing user data, static data
-* [[Cassandra]] DB storing metadata, tracking messages
-* [[Kafka]] servers tracking message aggregation and batch insert
-Applications:
+Google Cloud for Big Data
-* Customer frontend, middleware for orders and customs
+* MapReduce - can use Dataflow
-* [[Tomcat]] for Java services
+* Spark - can use Dataproc
-* [[Nginx]] for static content
+* BigQuery
-* Batch servers (?)
-Storage:
+Usage scenarios
-* iSCSI (internet small-computer-system interface) to manage VM hosts
-* Fiber channel network for SQL server storage
-* NAS (network attached storage) for image storage, logs, and backups
-Analytics:
+===Foundations===
-* [[Hadoop]]/[[Spark]] servers
-* Core data lake
-* Data analysis workloads
-Miscellaneous servers:
+Compute and Storage
-* [[Jenkins]]
-* Monitoring of servers
-* Bastion hosts
-* Security scanners
-* Billing software
-==Using Google Cloud==
+Data ingestion
-Databases:
+Data storage
-* MySQL: Google Cloud offers the Cloud SQL service, and you can allocate a specific compute instance to run a MySQL (or Postgresql) server.
-** See [[MySQL]]
-** See [[Google Cloud/MySQL]]
-* Cassandra: Google Cloud Launcher has several pre-configured solutions for different packages, including one for Cassandra.
-** See [[Cassandra]]
-** See [[Google Cloud/Cassandra]]
-* Kafka: as with Cassandra, preconfigured Kafka instances are available through the Google Cloud Launcher.
-** See [[Kafka]]
-** See [[Google Cloud/Kafka]]
+Federated analysis
-Note: there is a huge list of all possible Google Cloud products to help figure out what products are used for what technologies.
+Compute engine
-List of Google Cloud products: https://cloud.google.com/products/
+Cloud storage
-List of Google Cloud Launcher preconfigured machines: https://console.cloud.google.com/launcher
+===Data Analytics===
+Cloud SQL - relational database
+Dataproc for machine learning
+BigTop ecosystem:
+* Pig
+* Spark
+* Hive
+* Hadoop
+===Data Storage===
+Choosing a storage option: https://cloud.google.com/storage-options/
+Data warehousing:
+* Bigtable - low-latency and updatable data warehouse solution, data is not highly structurable, no need to support ACID transactions
+* BigQuery - petabyte-scale, structured, column-major, SQL-queryable data warehouse solution
+Data storage:
+* Cloud Storage - unstructured data (documents, sound files, PDFs, etc etc)
+* Cloud Datastore - non-relational (NoSQL), highly scalable storage solution; SQL-like query language; more restrictive queries (b/c optimized to be faster); supports ACID transactions
+* Cloud SQL - full SQL support and online transaction processing (OLTP) system
+* Cloud Spanner - (horizontally sharded SQL) fully managed mission-critical relational OLTP database that can scale horizontally to hundreds or thousands of servers to handle high workload transactions; supports ACID transactions
+Writeup of Spanner: https://quizlet.com/blog/quizlet-cloud-spanner
+===Scaling Data Analysis===
+(Transformational use cases)
+Datalab
+Datastore
+BigTable (fast random access, tradeoffs between consistency and availability)
+BigQuery (query petabytes in seconds)
+TensorFlow (distributed in the cloud over very large data sets)
+Demand forecasting with machine learning
+===Data Processing Architectures===
+PubSub (messaging architecture)
+Dataflow (way to execute code that processes streaming and batch data in similar ways)
+=Flags=
 [[Category:Google Cloud]]
+[[Category:Data Engineering]]

Google Cloud: Difference between revisions

From charlesreid1