Data Engineering
From charlesreid1
Overview
Data engineering - software engineering with an emphasis on dealing with large amounts of data
What is Data Engineering
Enable others to answer questions using datasets, within latency constraints
Components:
- Distributed systems
- Parallel processing
- Databases
- Queuing
Purpose?
- Human fault tolerance
- Metrics
- Monitoring
- Multi-tenancy
Example of where you start:
- Searches by keyword/user only
- Basic statistics only
- Using someone else's search engine
Example stack:
- Custom crawlers ingesting data (Gearman)
- Passing data off to custom workers
- Dumping data to MySQL/Sphinx/etc
Problems:
- Inflexibility
- Corruption is highly probable
- High burden on operations
- No scalability
- No fault tolerance
Alternative stack:
- Many collectors dumping to Amazon S3
- Analysis with Hadoop
- ElephantDB
- Low latency (but lead time of several hours)
- More advanced statistics (influencer of, influenced by)
Data pipeline example:
- Tweets go to S3
- URLS are normalized
- Each hour, new compute bucket
- Sum by hour and by url
- Emit ElephantDB indices
Another data pipeline example:
- Tweets go to Kafka
- URLs are normalized
- Each hour, new compute bucket
- Update hour/url bucket
- Send data to Cassandra
Clojure example:
- tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
- serialization of data using thrift
Infrastructure components:
- HDFS - distributed in-memory big data filesystem
- MapReduce - operations on HDFS data
- Kafka - messaging queue (and later, distributed processing on messages)
- Storm - distributed processing
- Spark - distributed, parallelized computation on HDFS data
- Cassandra - scalable database
- HBase - database operating on top of HDFS
- Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
- ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop
Multi-tenancy:
- Independent applications on a single cluster
- Topologies should not affect each other
- Topologies should have adequate resources (Apache Mesos)
- When submitting a job, specify resources needed
Data engineering vs data science:
- Data engineers have well-defined problems
- Data scientists need specialized statistical skills
- Data engineers deal with a larger scope - not just analytics
Open source:
- Important for recruitment
- Strong developers want to work where they can be involved in open source
- Popular open-source projects give access to better engineers
- identify good recruits, learn best practices, get help - not "free work"
Ideal data engineers:
- Strong software engineering skills
- Abstraction
- Testing
- Version control
- Refactoring
- Strong software engineering skills
- Strong algorithm skills
- Digging into open source code
- Stress testing
Finding strong data engineers:
- Standard "code on a whiteboard" interviews are useless
- Take-home projects to gauge general abilities
- Best: see projects requiring data engineering
Data Engineering Example: Twitter
Real time data:
- online queries for a web request
- offline computations with very low latency
- latency and throughput equally important
- Hadoop is too high-latency!
Four data problems at twitter:
- Tweets
- Timelines
- Social graphs
- Search indices
Tweets
Definitions of data:
- Tweet is primary key id, user id, text, timestamp (and replies)
- Row storage
- Initially: single table vertically scaled
- Initially: master-worker replication (writes to master, replication to workers)
- Initially: Memmcached for reads (rails reads the real database, populates memcached instances)
Issues:
- Scaling isk space: disk arrays > 800 GB problematic
- At 3 trillion tweets, disk space 90% utilized
Solutions:
- Partition: partition by primary key (one cluster holds one set of keys, another holds different)
- Partition: tweet IDs partitioned by user
- Partition: tweets by time
- Try each partition in order, until enough data accumulated
Locality of databases:
- Memcached: primary key lookup is 1 ms
- MySQL: primary key lookup is < 10 ms
- Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)
Problems:
- write throughput
- Deadlocks in MySQL (if tweet volume gets crazy)
- Temporal shard creation is manual process
More solutions:
- Cassandra (NoSQL, non-relational)
- primary key partition (tweet ID)
- but also, secondary key on user ID
Timelines
Definitions:
- Timeline is series of tweet ids
- Query pattern: organized by user ID
- Operations: append, merge, truncate
- high velocity bounded vector
- Space-based
Primitive approach: SQL query:
- Use the following SQL query (below)
- Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets
- BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?
SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20
offline vs online computation:
- Sequences can be stored in memcached (individual timelines)
- You pass a status to Fanout
- Fanout is offline, but has a low-latency SLlA
- Truncate at random intervals, ensuring bounded length
- What to do on a cache miss?
- Merge the user timelines
Stats:
- 2008: 30 TPS, 120 TPS peak
- 2010: 700 TPS, 2,000 TPS peak
- 1.2M deliveries per second
Memory hierarchy:
- Possibilities:
- Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)
Principles:
- Offline vs online computation
- Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)
- Must keep memory hierarchy in mind
- Efficiency of system includes cost of generating data from another source times probability of needing to
Social Graphs
Who follows you? Who do you follow? Who have you blocked, etc.
Operations:
- Enumerate by time
- Set operations: intersection, difference, union
- Inclusion, cardinality
Spam problems:
- Need mass delete ability
Temporal enumeration:
- Who you followed, listed by when you followed them
- Inclusion: do they follow you too?
- Cardinality: How many followers do they have? How many people are following them?
If person A tweets at Person B:
- Want to deliver tweet to people who follow both person A and person B
- Original implementation: single table, source_id and destination_id, and each contains ID of source/destination
- Single table, master-worker replication
What problems did this cause?
- Write throughput problem
- Inputs could not be kept in RAM
Solution:
- Partition by User ID
- Edges stored in BOTH forward AND backward direction (same tweet stored twice)
- Indexed by time
- Indexed by element (for doing set algebra)
- Partitioned by user: source_id of one and destination_id of other are identical
Challenges:
- Consistency in the presence of failures
- Write operations: idempotent (retry until success)
- Last-write wins for edges
- Commutative strategies for mass writes
- Low latency - ms time scale
Principles:
- Impossible to precompute set algebra queries
- Simple distributed coordination techniques
- Partition, replicate, and index
- Many efficient scalability problems are solved this way: Partition, Replicate, Index
Search Indices
Results of searching for, e.g., "mountain dew cheetos"
Search index:
- Find all tweets with these words in it
- Posting list
- Boolean
- Queries
- Complex queries
- Ad hoc queries
- Relevancy is recency (this ignores the non-real-time component to search...)
Searching for, e.g., mountain dew cheetos is the intersection of three posting lists
- Original implementation: single table, vertically scaled
- One column: term ID
- Another column: document ID
- Master-worker replication for read throughput
Problems: index cannot be kept in memory
Current implementation:
- Partitioned by TIME first
- Then partitioned by term ID...
- Use delayed key-write
What problems does the solution create:
- Write throughput issues
- Queries for rare terms have to search MANY partitions
- Space efficiency and recall
- MySQL requires loooots of memory
Data Engineering Scenarios
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
Dataproc
Dataproc Technologies
This is the "classic" big data technology - distributed computing on clusters.
Google Cloud product:
- Dataproc - allocate clusters, run jobs
Amazon product:
- Amazon EC2 - allocate clusters, run jobs
Hadoop ecosystem:
- Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
- Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
- Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
- HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
- Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
- Parquet - column-based table storage that sits on Hadoop
Spark technologies:
- Spark - similar to Hadoop, but more focused on efficient computation
- PySpark - Python bindings for Spark (Java)
- SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
Dataproc Scenario
The scenario here is dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.
Dataflow
Dataflow Technologies
Google Cloud product:
- Dataflow - building data processing pipelines for transforming streams, with sources/sinks
- PubSub - (unordered) streaming events and messaging
- Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow
Amazon product:
- Kinesis - streaming events? messaging?
Apache projects:
- Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
- Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
- Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)
Dataflow Scenarios
Scenario:
- Docker pod - generating messages and publishing them to a pipeline
- Docker container running a collector (unstructured/nosql)
- Docker container running a dashboard to visualize the collector database
Query
Query Technologies
Google Cloud products:
- BigQuery - petabyte-scale datasets
- BigTable - large, non-relational databases
- CloudSQL - elastic, scalable SQL databases in the cloud
Query Scenarios
Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery
Link: https://github.com/charlesreid1/sabermetrics-bigquery
Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery
Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery
Machine Learning
Machine Learning Technologies
Scikit:
- scikit-learn
- sklearn-pandas
Supporting py-data libraries:
- Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
- SQLAlchemy - SQL data into Python
- Seaborn - linear regression, basic models, essential plot types
- OpenCV - object and face detection
Classic Machine Learning Scenarios
Scenario ideas:
- Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
- Web frontend for OpenCV - bounding boxes where objects found
Neural Network Machine Learning
Neural Network Machine Learning Technologies
Google Cloud:
- Cloud ML APIs - using packaged/bundled API calls for achine learning.
- Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
- Compute Engine - scaling workflows to large data sets "by hand"
- (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)
Software:
- Keras
- TensorFlow
- Sonnet
- Theano
- MXNet
- etc etc etc
Goals?
- Predictive analytics
- Creating business value from unstructured/very large/unanalyzed data sets
Neural Network Machine Learning Scenarios
Scenario 1: SQL data in a Docker container, training a Keras neural network model
Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras
Scenario notes:
- Don't reinvent the wheel, use pre-trained models and APIs
- Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
- Scenario template: JS frontend, Flask glue, Keras/other Python backend
Scenario ideas:
- Pre-trained image recognition model, prediction of type of object, wrap front-end with graphs to show data, objects detected, etc.
- Trained face differences, upload two faces, give prediction.
GCDEC
Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.