Data Engineering
From charlesreid1
Data Engineering Scenarios
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
Dataproc
Dataproc Technologies
This is the "classic" big data technology - distributed computing on clusters.
Google Cloud product:
- Dataproc - allocate clusters, run jobs
Amazon product:
- Amazon EC2 - allocate clusters, run jobs
Hadoop ecosystem:
- Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
- Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
- Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
- HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
- Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
- Parquet - column-based table storage that sits on Hadoop
Spark technologies:
- Spark - similar to Hadoop, but more focused on efficient computation
- PySpark - Python bindings for Spark (Java)
- SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
Dataproc Scenario
The scenario here is dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.
Dataflow
Dataflow Technologies
Google Cloud product:
- Dataflow - building data processing pipelines for transforming streams, with sources/sinks
- PubSub - (unordered) streaming events and messaging
- Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow
Amazon product:
- Kinesis - streaming events? messaging?
Apache projects:
- Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
- Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
- Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)
Dataflow Scenarios
Scenario:
- Docker pod - generating messages and publishing them to a pipeline
- Docker container running a collector (unstructured/nosql)
- Docker container running a dashboard to visualize the collector database
Query
Query Technologies
Google Cloud products:
- BigQuery - petabyte-scale datasets
- BigTable - large, non-relational databases
- CloudSQL - elastic, scalable SQL databases in the cloud
Query Scenarios
Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery
Link: https://github.com/charlesreid1/sabermetrics-bigquery
Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
Machine Learning/NN
Machine Learning/NN Technologies
Google Cloud:
- Cloud ML APIs - using packaged/bundled API calls for achine learning.
- Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
- Compute Engine - scaling workflows to large data sets "by hand"
- (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)
Software:
- Keras
- TensorFlow
- Sonnet
- Theano
- MXNet
- CNTK
- Caffe
- etc etc etc
Goals?
- Predictive analytics
- (What does that mean?)
Machine Learning/NN Scenarios
Scenario 1: SQL data in a Docker container, training a Keras neural network model
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario notes:
- Scenarios should cover different neural network architecture: CNN, RNN, GRU, LSTM, etc.
- Scenarios should also cover different challenges: OOM training, image processing, fuel/kerosene supporting software for creating (software) pipelines, HDF5 data compression/storage, classification of sparse events, binary vs multiple classification, extremely large feature sets
- Scenarios should implement template: JS frontend, Flask glue, Keras/other Python backend
- Scenarios should utilize pre-trained networks when possible
Scenario ideas:
- CNN for image-processing and OOM training for large data sets - incorporate fuel/kerosene pipelines, compression, cloud storage
- RNN/LSTM for time-series prediction and messaging services - incorporate messaging and stream/batch processing to update neural network making a time-series prediction
Classic Machine Learning
Classic Machine Learning Technologies
Scikit:
- scikit-learn
- sklearn-pandas
Pandas
- join, merge, groupby, shift, time series analysis
Seaborn
- Linear regression
- Plot types
Image analysis:
- OpenCV (object and face detection)
GCDEC
Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.