Data Engineering: Difference between revisions
From charlesreid1
| Line 1: | Line 1: | ||
=Overview= | |||
Data engineering - software engineering with an emphasis on dealing with large amounts of data | |||
==What is Data Engineering== | |||
Enable others to answer questions using datasets, within latency constraints | |||
Components: | |||
* Distributed systems | |||
* Parallel processing | |||
* Databases | |||
* Queuing | |||
Purpose? | |||
* Human fault tolerance | |||
* Metrics | |||
* Monitoring | |||
* Multi-tenancy | |||
Example of where you start: | |||
* Searches by keyword/user only | |||
* Basic statistics only | |||
* Using someone else's search engine | |||
Example stack: | |||
* Custom crawlers ingesting data (Gearman) | |||
* Passing data off to custom workers | |||
* Dumping data to MySQL/Sphinx/etc | |||
Problems: | |||
* Inflexibility | |||
* Corruption is highly probable | |||
* High burden on operations | |||
* No scalability | |||
* No fault tolerance | |||
Alternative stack: | |||
* Many collectors dumping to Amazon S3 | |||
* Analysis with Hadoop | |||
* ElephantDB | |||
* Low latency (but lead time of several hours) | |||
* More advanced statistics (influencer of, influenced by) | |||
Data pipeline example: | |||
* Tweets go to S3 | |||
* URLS are normalized | |||
* Each hour, new compute bucket | |||
* Sum by hour and by url | |||
* Emit ElephantDB indices | |||
Another data pipeline example: | |||
* Tweets go to Kafka | |||
* URLs are normalized | |||
* Each hour, new compute bucket | |||
* Update hour/url bucket | |||
* Send data to Cassandra | |||
Clojure example: | |||
* tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores | |||
* serialization of data using thrift | |||
Infrastructure components: | |||
* HDFS - distributed in-memory big data filesystem | |||
* MapReduce - operations on HDFS data | |||
* Kafka - messaging queue (and later, distributed processing on messages) | |||
* Storm - distributed processing | |||
* Spark - distributed, parallelized computation on HDFS data | |||
* Cassandra - scalable database | |||
* HBase - database operating on top of HDFS | |||
* Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services) | |||
* ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop | |||
Multi-tenancy: | |||
* Independent applications on a single cluster | |||
* Topologies should not affect each other | |||
* Topologies should have adequate resources (Apache Mesos) | |||
* When submitting a job, specify resources needed | |||
Data engineering vs data science: | |||
* Data engineers have well-defined problems | |||
* Data scientists need specialized statistical skills | |||
* Data engineers deal with a larger scope - not just analytics | |||
Open source: | |||
* Important for recruitment | |||
* Strong developers want to work where they can be involved in open source | |||
* Popular open-source projects give access to better engineers | |||
* identify good recruits, learn best practices, get help - not "free work" | |||
Ideal data engineers: | |||
* Strong software engineering skills | |||
* Abstraction | |||
* Testing | |||
* Version control | |||
* Refactoring | |||
* Strong software engineering skills | |||
* Strong algorithm skills | |||
* Digging into open source code | |||
* Stress testing | |||
Finding strong data engineers: | |||
* Standard "code on a whiteboard" interviews are useless | |||
* Take-home projects to gauge general abilities | |||
* Best: see projects requiring data engineering | |||
==Data Engineering Example: Twitter== | |||
Real time data: | |||
* online queries for a web request | |||
* offline computations with very low latency | |||
* latency and throughput equally important | |||
* Hadoop is too high-latency! | |||
Four data problems at twitter: | |||
* Tweets | |||
* Timelines | |||
* Social graphs | |||
* Search indices | |||
Definitions of data: | |||
* Tweet is primary key id, user id, text, timestamp (and replies) | |||
* Row storage | |||
* Initially: single table vertically scaled | |||
* Initially: master-worker replication (writes to master, replication to workers) | |||
* Initially: Memmcached for reads (rails reads the real database, populates memcached instances) | |||
Issues: | |||
* Scaling isk space: disk arrays > 800 GB problematic | |||
* At 3 trillion tweets, disk space 90% utilized | |||
Solutions: | |||
* Partition: partition by primary key (one cluster holds one set of keys, another holds different) | |||
* Partition: tweet IDs partitioned by user | |||
* Partition: tweets by time | |||
* Try each partition in order, until enough data accumulated | |||
Locality of databases: | |||
* Memcached: primary key lookup is 1 ms | |||
* MySQL: primary key lookup is < 10 ms | |||
* Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked) | |||
Problems: | |||
* write throughput | |||
* Deadlocks in MySQL (if tweet volume gets crazy) | |||
* Temporal shard creation is manual process | |||
More solutions: | |||
* Cassandra (NoSQL, non-relational) | |||
* primary key partition (tweet ID) | |||
* but also, secondary key on user ID | |||
=Data Engineering Scenarios= | =Data Engineering Scenarios= | ||
Revision as of 04:38, 20 October 2017
Overview
Data engineering - software engineering with an emphasis on dealing with large amounts of data
What is Data Engineering
Enable others to answer questions using datasets, within latency constraints
Components:
- Distributed systems
- Parallel processing
- Databases
- Queuing
Purpose?
- Human fault tolerance
- Metrics
- Monitoring
- Multi-tenancy
Example of where you start:
- Searches by keyword/user only
- Basic statistics only
- Using someone else's search engine
Example stack:
- Custom crawlers ingesting data (Gearman)
- Passing data off to custom workers
- Dumping data to MySQL/Sphinx/etc
Problems:
- Inflexibility
- Corruption is highly probable
- High burden on operations
- No scalability
- No fault tolerance
Alternative stack:
- Many collectors dumping to Amazon S3
- Analysis with Hadoop
- ElephantDB
- Low latency (but lead time of several hours)
- More advanced statistics (influencer of, influenced by)
Data pipeline example:
- Tweets go to S3
- URLS are normalized
- Each hour, new compute bucket
- Sum by hour and by url
- Emit ElephantDB indices
Another data pipeline example:
- Tweets go to Kafka
- URLs are normalized
- Each hour, new compute bucket
- Update hour/url bucket
- Send data to Cassandra
Clojure example:
- tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
- serialization of data using thrift
Infrastructure components:
- HDFS - distributed in-memory big data filesystem
- MapReduce - operations on HDFS data
- Kafka - messaging queue (and later, distributed processing on messages)
- Storm - distributed processing
- Spark - distributed, parallelized computation on HDFS data
- Cassandra - scalable database
- HBase - database operating on top of HDFS
- Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
- ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop
Multi-tenancy:
- Independent applications on a single cluster
- Topologies should not affect each other
- Topologies should have adequate resources (Apache Mesos)
- When submitting a job, specify resources needed
Data engineering vs data science:
- Data engineers have well-defined problems
- Data scientists need specialized statistical skills
- Data engineers deal with a larger scope - not just analytics
Open source:
- Important for recruitment
- Strong developers want to work where they can be involved in open source
- Popular open-source projects give access to better engineers
- identify good recruits, learn best practices, get help - not "free work"
Ideal data engineers:
- Strong software engineering skills
- Abstraction
- Testing
- Version control
- Refactoring
- Strong software engineering skills
- Strong algorithm skills
- Digging into open source code
- Stress testing
Finding strong data engineers:
- Standard "code on a whiteboard" interviews are useless
- Take-home projects to gauge general abilities
- Best: see projects requiring data engineering
Data Engineering Example: Twitter
Real time data:
- online queries for a web request
- offline computations with very low latency
- latency and throughput equally important
- Hadoop is too high-latency!
Four data problems at twitter:
- Tweets
- Timelines
- Social graphs
- Search indices
Definitions of data:
- Tweet is primary key id, user id, text, timestamp (and replies)
- Row storage
- Initially: single table vertically scaled
- Initially: master-worker replication (writes to master, replication to workers)
- Initially: Memmcached for reads (rails reads the real database, populates memcached instances)
Issues:
- Scaling isk space: disk arrays > 800 GB problematic
- At 3 trillion tweets, disk space 90% utilized
Solutions:
- Partition: partition by primary key (one cluster holds one set of keys, another holds different)
- Partition: tweet IDs partitioned by user
- Partition: tweets by time
- Try each partition in order, until enough data accumulated
Locality of databases:
- Memcached: primary key lookup is 1 ms
- MySQL: primary key lookup is < 10 ms
- Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)
Problems:
- write throughput
- Deadlocks in MySQL (if tweet volume gets crazy)
- Temporal shard creation is manual process
More solutions:
- Cassandra (NoSQL, non-relational)
- primary key partition (tweet ID)
- but also, secondary key on user ID
Data Engineering Scenarios
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
Dataproc
Dataproc Technologies
This is the "classic" big data technology - distributed computing on clusters.
Google Cloud product:
- Dataproc - allocate clusters, run jobs
Amazon product:
- Amazon EC2 - allocate clusters, run jobs
Hadoop ecosystem:
- Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
- Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
- Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
- HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
- Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
- Parquet - column-based table storage that sits on Hadoop
Spark technologies:
- Spark - similar to Hadoop, but more focused on efficient computation
- PySpark - Python bindings for Spark (Java)
- SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
Dataproc Scenario
The scenario here is dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.
Dataflow
Dataflow Technologies
Google Cloud product:
- Dataflow - building data processing pipelines for transforming streams, with sources/sinks
- PubSub - (unordered) streaming events and messaging
- Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow
Amazon product:
- Kinesis - streaming events? messaging?
Apache projects:
- Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
- Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
- Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)
Dataflow Scenarios
Scenario:
- Docker pod - generating messages and publishing them to a pipeline
- Docker container running a collector (unstructured/nosql)
- Docker container running a dashboard to visualize the collector database
Query
Query Technologies
Google Cloud products:
- BigQuery - petabyte-scale datasets
- BigTable - large, non-relational databases
- CloudSQL - elastic, scalable SQL databases in the cloud
Query Scenarios
Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery
Link: https://github.com/charlesreid1/sabermetrics-bigquery
Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
Machine Learning
Machine Learning Technologies
Scikit:
- scikit-learn
- sklearn-pandas
Supporting py-data libraries:
- Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
- SQLAlchemy - SQL data into Python
- Seaborn - linear regression, basic models, essential plot types
- OpenCV - object and face detection
Classic Machine Learning Scenarios
Scenario ideas:
- Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
- Web frontend for OpenCV - bounding boxes where objects found
Neural Network Machine Learning
Neural Network Machine Learning Technologies
Google Cloud:
- Cloud ML APIs - using packaged/bundled API calls for achine learning.
- Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
- Compute Engine - scaling workflows to large data sets "by hand"
- (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)
Software:
- Keras
- TensorFlow
- Sonnet
- Theano
- MXNet
- etc etc etc
Goals?
- Predictive analytics
- Creating business value from unstructured/very large/unanalyzed data sets
Neural Network Machine Learning Scenarios
Scenario 1: SQL data in a Docker container, training a Keras neural network model
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario notes:
- Don't reinvent the wheel, use pre-trained models and APIs
- Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
- Scenario template: JS frontend, Flask glue, Keras/other Python backend
Scenario ideas:
- Pre-trained image recognition model, prediction of type of object, wrap front-end with graphs to show data, objects detected, etc.
- Trained face differences, upload two faces, give prediction.
GCDEC
Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.