Overview

Data engineering - software engineering with an emphasis on dealing with large amounts of data

What is Data Engineering

Enable others to answer questions using datasets, within latency constraints

Components:

Distributed systems
Parallel processing
Databases
Queuing

Purpose?

Human fault tolerance
Metrics
Monitoring
Multi-tenancy

Example of where you start:

Searches by keyword/user only
Basic statistics only
Using someone else's search engine

Example stack:

Custom crawlers ingesting data (Gearman)
Passing data off to custom workers
Dumping data to MySQL/Sphinx/etc

Problems:

Inflexibility
Corruption is highly probable
High burden on operations
No scalability
No fault tolerance

Alternative stack:

Many collectors dumping to Amazon S3
Analysis with Hadoop
ElephantDB
Low latency (but lead time of several hours)
More advanced statistics (influencer of, influenced by)

Data pipeline example:

Tweets go to S3
URLS are normalized
Each hour, new compute bucket
Sum by hour and by url
Emit ElephantDB indices

Another data pipeline example:

Tweets go to Kafka
URLs are normalized
Each hour, new compute bucket
Update hour/url bucket
Send data to Cassandra

Clojure example:

tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
serialization of data using thrift

Infrastructure components:

HDFS - distributed in-memory big data filesystem
MapReduce - operations on HDFS data
Kafka - messaging queue (and later, distributed processing on messages)
Storm - distributed processing
Spark - distributed, parallelized computation on HDFS data
Cassandra - scalable database
HBase - database operating on top of HDFS
Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop

Multi-tenancy:

Independent applications on a single cluster
Topologies should not affect each other
Topologies should have adequate resources (Apache Mesos)
When submitting a job, specify resources needed

Data engineering vs data science:

Data engineers have well-defined problems
Data scientists need specialized statistical skills
Data engineers deal with a larger scope - not just analytics

Open source:

Important for recruitment
Strong developers want to work where they can be involved in open source
Popular open-source projects give access to better engineers
identify good recruits, learn best practices, get help - not "free work"

Ideal data engineers:

Strong software engineering skills
Abstraction
Testing
Version control
Refactoring
Strong software engineering skills
Strong algorithm skills
Digging into open source code
Stress testing

Finding strong data engineers:

Standard "code on a whiteboard" interviews are useless
Take-home projects to gauge general abilities
Best: see projects requiring data engineering

Data Engineering Example: Twitter

Real time data:

online queries for a web request
offline computations with very low latency
latency and throughput equally important
Hadoop is too high-latency!

Four data problems at twitter:

Tweets
Timelines
Social graphs
Search indices

Tweets

Definitions of data:

Tweet is primary key id, user id, text, timestamp (and replies)
Row storage
Initially: single table vertically scaled
Initially: master-worker replication (writes to master, replication to workers)
Initially: Memmcached for reads (rails reads the real database, populates memcached instances)

Issues:

Scaling isk space: disk arrays > 800 GB problematic
At 3 trillion tweets, disk space 90% utilized

Solutions:

Partition: partition by primary key (one cluster holds one set of keys, another holds different)
Partition: tweet IDs partitioned by user
Partition: tweets by time
Try each partition in order, until enough data accumulated

Locality of databases:

Memcached: primary key lookup is 1 ms
MySQL: primary key lookup is < 10 ms
Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)

Problems:

write throughput
Deadlocks in MySQL (if tweet volume gets crazy)
Temporal shard creation is manual process

Timelines

Definitions:

Timeline is series of tweet ids
Query pattern: organized by user ID
Operations: append, merge, truncate
high velocity bounded vector
Space-based

Primitive approach: SQL query:

Use the following SQL query (below)
Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets
BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?

SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20

offline vs online computation:

Sequences can be stored in memcached (individual timelines)
You pass a status to Fanout
Fanout is offline, but has a low-latency SLlA
Truncate at random intervals, ensuring bounded length
What to do on a cache miss?
Merge the user timelines

Stats:

2008: 30 TPS, 120 TPS peak
2010: 700 TPS, 2,000 TPS peak
1.2M deliveries per second

Memory hierarchy:

Possibilities:
Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)

Principles:

Offline vs online computation
Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)
Must keep memory hierarchy in mind
Efficiency of system includes cost of generating data from another source times probability of needing to

Social Graphs

Who follows you? Who do you follow? Who have you blocked, etc.

Operations:

Enumerate by time
Set operations: intersection, difference, union
Inclusion, cardinality

Spam problems:

Need mass delete ability

Temporal enumeration:

Who you followed, listed by when you followed them
Inclusion: do they follow you too?
Cardinality: How many followers do they have? How many people are following them?

If person A tweets at Person B:

Want to deliver tweet to people who follow both person A and person B
Original implementation: single table, source_id and destination_id, and each contains ID of source/destination
Single table, master-worker replication

What problems did this cause?

Write throughput problem
Inputs could not be kept in RAM

Solution:

Partition by User ID
Edges stored in BOTH forward AND backward direction (same tweet stored twice)
Indexed by time
Indexed by element (for doing set algebra)
Partitioned by user: source_id of one and destination_id of other are identical

Challenges:

Consistency in the presence of failures
Write operations: idempotent (retry until success)
Last-write wins for edges
Commutative strategies for mass writes
Low latency - ms time scale

Principles:

Impossible to precompute set algebra queries
Simple distributed coordination techniques
Partition, replicate, and index
Many efficient scalability problems are solved this way: Partition, Replicate, Index

Search Indices

Results of searching for, e.g., "mountain dew cheetos"

Search index:

Find all tweets with these words in it
Posting list
Boolean
Queries
Complex queries
Ad hoc queries
Relevancy is recency (this ignores the non-real-time component to search...)

Searching for, e.g., mountain dew cheetos is the intersection of three posting lists

Original implementation: single table, vertically scaled
One column: term ID
Another column: document ID
Master-worker replication for read throughput

Problems: index cannot be kept in memory

Current implementation:

Partitioned by TIME first
Then partitioned by term ID...
Use delayed key-write

What problems does the solution create:

Write throughput issues
Queries for rare terms have to search MANY partitions
Space efficiency and recall
MySQL requires loooots of memory

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

Dataproc Technologies

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

Dataproc - allocate clusters, run jobs

Amazon product:

Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
Parquet - column-based table storage that sits on Hadoop

Spark technologies:

Spark - similar to Hadoop, but more focused on efficient computation
PySpark - Python bindings for Spark (Java)
SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataproc Scenario

The scenario here is dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.

Dataflow

Dataflow Technologies

Google Cloud product:

Dataflow - building data processing pipelines for transforming streams, with sources/sinks
PubSub - (unordered) streaming events and messaging
Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

Kinesis - streaming events? messaging?

Apache projects:

Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)

Dataflow Scenarios

Scenario:

Docker pod - generating messages and publishing them to a pipeline
Docker container running a collector (unstructured/nosql)
Docker container running a dashboard to visualize the collector database

Query

Query Technologies

Google Cloud products:

BigQuery - petabyte-scale datasets
BigTable - large, non-relational databases
CloudSQL - elastic, scalable SQL databases in the cloud

Query Scenarios

Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery

Link: https://github.com/charlesreid1/sabermetrics-bigquery

Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

Machine Learning

Machine Learning Technologies

Scikit:

scikit-learn
sklearn-pandas

Supporting py-data libraries:

Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
SQLAlchemy - SQL data into Python
Seaborn - linear regression, basic models, essential plot types
OpenCV - object and face detection

Classic Machine Learning Scenarios

Scenario ideas:

Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
Web frontend for OpenCV - bounding boxes where objects found

Neural Network Machine Learning

Neural Network Machine Learning Technologies

Google Cloud:

Cloud ML APIs - using packaged/bundled API calls for achine learning.
Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
Compute Engine - scaling workflows to large data sets "by hand"
(Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)

Software:

Keras
TensorFlow
Sonnet
Theano
MXNet
etc etc etc

Goals?

Predictive analytics
Creating business value from unstructured/very large/unanalyzed data sets

Neural Network Machine Learning Scenarios

Scenario 1: SQL data in a Docker container, training a Keras neural network model