From charlesreid1

Line 106: Line 106:


Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
==Neural Network Machine Learning==
===Neural Network Machine Learning Technologies===
Google Cloud:
* Cloud ML APIs - using packaged/bundled API calls for achine learning.
* Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
* Compute Engine - scaling workflows to large data sets "by hand"
* (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)
Software:
* Keras
* TensorFlow
* Sonnet
* Theano
* MXNet
* etc etc etc
Goals?
* Predictive analytics
* Creating business value from unstructured/very large/unanalyzed data sets
===Neural Network Machine Learning Scenarios===
Scenario 1: SQL data in a Docker container, training a Keras neural network model
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario notes:
* Don't reinvent the wheel, use pre-trained models and APIs
* Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
* Scenario template: JS frontend, Flask glue, Keras/other Python backend
Scenario ideas:
* Pre-trained image recognition model, wrap front-end with graphs to show data, objects detected, etc.
* Trained face differences, upload two faces, give prediction.


=GCDEC=
=GCDEC=

Revision as of 22:25, 13 October 2017

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

Dataproc Technologies

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

  • Dataproc - allocate clusters, run jobs

Amazon product:

  • Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

  • Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
  • Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
  • Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
  • HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
  • Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
  • Parquet - column-based table storage that sits on Hadoop

Spark technologies:

  • Spark - similar to Hadoop, but more focused on efficient computation
  • PySpark - Python bindings for Spark (Java)
  • SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataproc Scenario

The scenario here is dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.

Dataflow

Dataflow Technologies

Google Cloud product:

  • Dataflow - building data processing pipelines for transforming streams, with sources/sinks
  • PubSub - (unordered) streaming events and messaging
  • Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

  • Kinesis - streaming events? messaging?

Apache projects:

  • Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
  • Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
  • Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)

Dataflow Scenarios

Scenario:

  • Docker pod - generating messages and publishing them to a pipeline
  • Docker container running a collector (unstructured/nosql)
  • Docker container running a dashboard to visualize the collector database

Query

Query Technologies

Google Cloud products:

  • BigQuery - petabyte-scale datasets
  • BigTable - large, non-relational databases
  • CloudSQL - elastic, scalable SQL databases in the cloud

Query Scenarios

Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery

Link: https://github.com/charlesreid1/sabermetrics-bigquery

Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

Classic Machine Learning

Classic Machine Learning Technologies

Scikit:

  • scikit-learn
  • sklearn-pandas

Pandas

  • join, merge, groupby, shift, time series analysis

Seaborn

  • Linear regression
  • Basic plot types

Image analysis:

  • OpenCV (object and face detection)

Classic Machine Learning Scenarios

Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected


Neural Network Machine Learning

Neural Network Machine Learning Technologies

Google Cloud:

  • Cloud ML APIs - using packaged/bundled API calls for achine learning.
  • Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
  • Compute Engine - scaling workflows to large data sets "by hand"
  • (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)

Software:

  • Keras
  • TensorFlow
  • Sonnet
  • Theano
  • MXNet
  • etc etc etc

Goals?

  • Predictive analytics
  • Creating business value from unstructured/very large/unanalyzed data sets

Neural Network Machine Learning Scenarios

Scenario 1: SQL data in a Docker container, training a Keras neural network model

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario notes:

  • Don't reinvent the wheel, use pre-trained models and APIs
  • Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
  • Scenario template: JS frontend, Flask glue, Keras/other Python backend

Scenario ideas:

  • Pre-trained image recognition model, wrap front-end with graphs to show data, objects detected, etc.
  • Trained face differences, upload two faces, give prediction.

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.


Flags