Revision as of 21:34, 13 October 2017

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

Dataproc Technologies

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

Dataproc - allocate clusters, run jobs

Amazon product:

Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
Parquet - column-based table storage that sits on Hadoop

Spark technologies:

Spark - similar to Hadoop, but more focused on efficient computation
PySpark - Python bindings for Spark (Java)
SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataproc Scenario

The scenario here is dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.

Dataflow

Dataflow Technologies

Google Cloud product:

Dataflow - building data processing pipelines for transforming streams, with sources/sinks
PubSub - (unordered) streaming events and messaging
Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

Kinesis - streaming events? messaging?

Apache projects:

Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters

Dataflow Scenarios

Scenario:

Docker pod - generating messages and publishing them to a pipeline
Docker container running a collector (unstructured/nosql)
Docker container running a dashboard to visualize the collector database

BigQuery

Query Technologies

Google Cloud products:

BigQuery - petabyte-scale datasets
BigTable - large, non-relational databases
CloudSQL - elastic, scalable SQL databases in the cloud

Query Scenarios

Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery

Link: https://github.com/charlesreid1/sabermetrics-bigquery

Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

Machine Learning

Machine Learning/NN Technologies

Google Cloud:

Cloud ML APIs - using packaged/bundled API calls for achine learning.
Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
Compute Engine - scaling workflows to large data sets "by hand"
(Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)

Software:

Keras
TensorFlow
Sonnet
Theano
MXNet
CNTK
Caffe
etc etc etc

Machine Learning/NN Scenarios

Scenario 1: SQL data in a Docker container, training a Keras neural network model

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Other scenarios:

Template: JS frontend, Flask glue, Keras backend
Out of memory learning - facial recognition - large image data sets - fuel and kerosene - trained "in your face" model - web frontend, Flask glue, Keras backend (two photos of two people, see if they are identical)
Inception model, nice-and-easy, pass an image to the inception model, display a chart with predictions and certainty

Neural network templates/examples:

Basic neural network architectures
CNN
RNN
GRU
LSTM

Incorporation of each type with various scenarios:

CNN for image-processing and OOM training for large data sets - incorporate with cloud storage scenario
RNN/LSTM for time series prediction - incorporate with messaging/Kafka

Classic Machine Learning

Classic Machine Learning Technologies

Scikit:

scikit-learn
sklearn-pandas

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.

Flags

Data Engineering: Difference between revisions

From charlesreid1