From charlesreid1

Line 107: Line 107:
==Data Engineering Example: Twitter==
==Data Engineering Example: Twitter==


Real time data:
[[Data Engineering/Twitter Example]]
* online queries for a web request
* offline computations with very low latency
* latency and throughput equally important
* Hadoop is too high-latency!
 
Four data problems at twitter:
* Tweets
* Timelines
* Social graphs
* Search indices
 
===Tweets===
 
Definitions of data:
* Tweet is primary key id, user id, text, timestamp (and replies)
* Row storage
* Initially: single table vertically scaled
* Initially: master-worker replication (writes to master, replication to workers)
* Initially: Memmcached for reads (rails reads the real database, populates memcached instances)
 
Issues:
* Scaling isk space: disk arrays > 800 GB problematic
* At 3 trillion tweets, disk space 90% utilized
 
Solutions:
* Partition: partition by primary key (one cluster holds one set of keys, another holds different)
* Partition: tweet IDs partitioned by user
* Partition: tweets by time
* Try each partition in order, until enough data accumulated
 
Locality of databases:
* Memcached: primary key lookup is 1 ms
* MySQL: primary key lookup is < 10 ms
* Exploit locality speed by organizing tweets by timestamp (usually only 1 partition checked)
 
Problems:
* write throughput
* Deadlocks in MySQL (if tweet volume gets crazy)
* Temporal shard creation is manual process
 
More solutions:
* Cassandra (NoSQL, non-relational)
* primary key partition (tweet ID)
* but also, secondary key on user ID
 
===Timelines===
 
Definitions:
* Timeline is series of tweet ids
* Query pattern: organized by user ID
* Operations: append, merge, truncate
* high velocity bounded vector
* Space-based
 
Primitive approach: SQL query:
* Use the following SQL query (below)
* Bingo, you have your subquery of followers who are tweeting at anyone, passed into the search for tweets
* BUT - what happens if you have lots of friends, or if the number of source tweet IDs cant fit in RAM?
 
<pre>
SELECT * FROM tweets WHERE user_id IN (SELECT source_id FROM followers WHERE destination_id = ?) ORDER BY created_at DESC LIMIT 20
</pre>
 
offline vs online computation:
* Sequences can be stored in memcached (individual timelines)
* You pass a status to Fanout
* Fanout is offline, but has a low-latency SLlA
* Truncate at random intervals, ensuring bounded length
* What to do on a cache miss?
* Merge the user timelines
 
Stats:
* 2008: 30 TPS, 120 TPS peak
* 2010: 700 TPS, 2,000 TPS peak
* 1.2M deliveries per second
 
Memory hierarchy:
* Possibilities:
* Fanout to disk (Lots of IOPS required, even iwth buffering; cost of rebuilding data from other data stores is reasonable; fanout to memory)
 
Principles:
* Offline vs online computation
* Some problems can be pre-computed (if amt of work bounded, and queyr pattern limtied)
* Must keep memory hierarchy in mind
* Efficiency of system includes cost of generating data from another source times probability of needing to
 
===Social Graphs===
 
Who follows you? Who do you follow? Who have you blocked, etc.
 
Operations:
* Enumerate by time
* Set operations: intersection, difference, union
* Inclusion, cardinality
 
Spam problems:
* Need mass delete ability
 
Temporal enumeration:
* Who you followed, listed by when you followed them
* Inclusion: do they follow you too?
* Cardinality: How many followers do they have? How many people are following them?
 
If person A tweets at Person B:
* Want to deliver tweet to people who follow both person A and person B
* Original implementation: single table, source_id and destination_id, and each contains ID of source/destination
* Single table, master-worker replication
 
What problems did this cause?
* Write throughput problem
* Inputs could not be kept in RAM
 
Solution:
* Partition by User ID
* Edges stored in BOTH forward AND backward direction (same tweet stored twice)
* Indexed by time
* Indexed by element (for doing set algebra)
* Partitioned by user: source_id of one and destination_id of other are identical
 
Challenges:
* Consistency in the presence of failures
* Write operations: idempotent (retry until success)
* Last-write wins for edges
* Commutative strategies for mass writes
* Low latency - ms time scale
 
Principles:
* Impossible to precompute set algebra queries
* Simple distributed coordination techniques
* Partition, replicate, and index
* Many efficient scalability problems are solved this way: Partition, Replicate, Index
 
===Search Indices===
 
Results of searching for, e.g., "mountain dew cheetos"
 
Search index:
* Find all tweets with these words in it
* Posting list
* Boolean
* Queries
* Complex queries
* Ad hoc queries
* Relevancy is recency (this ignores the non-real-time component to search...)
 
Searching for, e.g., mountain dew cheetos is the intersection of three posting lists
* Original implementation: single table, vertically scaled
* One column: term ID
* Another column: document ID
* Master-worker replication for read throughput
 
Problems: index cannot be kept in memory
 
Current implementation:
* Partitioned by TIME first
* Then partitioned by term ID...
* Use delayed key-write
 
What problems does the solution create:
* Write throughput issues
* Queries for rare terms have to search MANY partitions
* Space efficiency and recall
* MySQL requires loooots of memory


=Data Engineering Scenarios=
=Data Engineering Scenarios=

Revision as of 08:21, 20 October 2017

Overview

Data engineering - software engineering with an emphasis on dealing with large amounts of data

What is Data Engineering

Enable others to answer questions using datasets, within latency constraints

Components:

  • Distributed systems
  • Parallel processing
  • Databases
  • Queuing

Purpose?

  • Human fault tolerance
  • Metrics
  • Monitoring
  • Multi-tenancy

Example of where you start:

  • Searches by keyword/user only
  • Basic statistics only
  • Using someone else's search engine

Example stack:

  • Custom crawlers ingesting data (Gearman)
  • Passing data off to custom workers
  • Dumping data to MySQL/Sphinx/etc

Problems:

  • Inflexibility
  • Corruption is highly probable
  • High burden on operations
  • No scalability
  • No fault tolerance

Alternative stack:

  • Many collectors dumping to Amazon S3
  • Analysis with Hadoop
  • ElephantDB
  • Low latency (but lead time of several hours)
  • More advanced statistics (influencer of, influenced by)

Data pipeline example:

  • Tweets go to S3
  • URLS are normalized
  • Each hour, new compute bucket
  • Sum by hour and by url
  • Emit ElephantDB indices

Another data pipeline example:

  • Tweets go to Kafka
  • URLs are normalized
  • Each hour, new compute bucket
  • Update hour/url bucket
  • Send data to Cassandra

Clojure example:

  • tweet reactor/tweet reaction/tweet reshare/now-secs/interaction/interaction-scores
  • serialization of data using thrift

Infrastructure components:

  • HDFS - distributed in-memory big data filesystem
  • MapReduce - operations on HDFS data
  • Kafka - messaging queue (and later, distributed processing on messages)
  • Storm - distributed processing
  • Spark - distributed, parallelized computation on HDFS data
  • Cassandra - scalable database
  • HBase - database operating on top of HDFS
  • Zookeeper - highly reliable distributed coordination (maintain config info, naming, synchronization, and multiple services)
  • ElephantDB - like a NoSQL Hadoop store - key/value data in Hadoop

Multi-tenancy:

  • Independent applications on a single cluster
  • Topologies should not affect each other
  • Topologies should have adequate resources (Apache Mesos)
  • When submitting a job, specify resources needed

Data engineering vs data science:

  • Data engineers have well-defined problems
  • Data scientists need specialized statistical skills
  • Data engineers deal with a larger scope - not just analytics

Open source:

  • Important for recruitment
  • Strong developers want to work where they can be involved in open source
  • Popular open-source projects give access to better engineers
  • identify good recruits, learn best practices, get help - not "free work"

Ideal data engineers:

  • Strong software engineering skills
  • Abstraction
  • Testing
  • Version control
  • Refactoring
  • Strong software engineering skills
  • Strong algorithm skills
  • Digging into open source code
  • Stress testing

Finding strong data engineers:

  • Standard "code on a whiteboard" interviews are useless
  • Take-home projects to gauge general abilities
  • Best: see projects requiring data engineering

Data Engineering Example: Twitter

Data Engineering/Twitter Example

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

Dataproc Technologies

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

  • Dataproc - allocate clusters, run jobs

Amazon product:

  • Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

  • Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
  • Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
  • Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
  • HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
  • Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
  • Parquet - column-based table storage that sits on Hadoop

Spark technologies:

  • Spark - similar to Hadoop, but more focused on efficient computation
  • PySpark - Python bindings for Spark (Java)
  • SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataproc Scenario

The scenario here is dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

This gets a Dataproc cluster, and runs a Spark job on the cluster that downloads images, extracts k mean color clusters from the image, and pushes the results to BigQuery.

Dataflow

Dataflow Technologies

Google Cloud product:

  • Dataflow - building data processing pipelines for transforming streams, with sources/sinks
  • PubSub - (unordered) streaming events and messaging
  • Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

  • Kinesis - streaming events? messaging?

Apache projects:

  • Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters
  • Avro - a data serialization service; turns rich data structures into streams of binary data that can be easily passed around; uses dynamic typing (no code generated - based on schema); smaller serialization size (info about scheme doesn't travel with the data - but data is stored alongside its schema.)
  • Thrift - provides cross-talk language for programs in different languages to pass data between them (data and service interfaces)

Dataflow Scenarios

Scenario:

  • Docker pod - generating messages and publishing them to a pipeline
  • Docker container running a collector (unstructured/nosql)
  • Docker container running a dashboard to visualize the collector database

Query

Query Technologies

Google Cloud products:

  • BigQuery - petabyte-scale datasets
  • BigTable - large, non-relational databases
  • CloudSQL - elastic, scalable SQL databases in the cloud

Query Scenarios

Scenario 1: BigQuery examples (working out assembling SQL queries) for open data sets on BigQuery

Link: https://github.com/charlesreid1/sabermetrics-bigquery

Scenario 2: Docker-containerized SQL database, jupyter notebook, for neural network training

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario 3: BigQuery as source/sink for images in dataproc-spark-kmeans-images-bigquery

Link: https://github.com/data-engineering-scenarios/dataproc-spark-kmeans-images-bigquery

Machine Learning

Machine Learning Technologies

Scikit:

  • scikit-learn
  • sklearn-pandas

Supporting py-data libraries:

  • Pandas - join, merge, groupby, shift, time series analysis, SQL to dataframe
  • SQLAlchemy - SQL data into Python
  • Seaborn - linear regression, basic models, essential plot types
  • OpenCV - object and face detection

Classic Machine Learning Scenarios

Scenario ideas:

  • Time series for messaging services - logs and traffic, outlier detection, publishing messages when anomalies detected
  • Web frontend for OpenCV - bounding boxes where objects found

Neural Network Machine Learning

Neural Network Machine Learning Technologies

Google Cloud:

  • Cloud ML APIs - using packaged/bundled API calls for achine learning.
  • Cloud ML Engine - training TensorFlow models in the cloud with elastic cluster sizes
  • Compute Engine - scaling workflows to large data sets "by hand"
  • (Integration of larger data stores, e.g., BigQuery/Cloud Storage, with ML training)

Software:

  • Keras
  • TensorFlow
  • Sonnet
  • Theano
  • MXNet
  • etc etc etc

Goals?

  • Predictive analytics
  • Creating business value from unstructured/very large/unanalyzed data sets

Neural Network Machine Learning Scenarios

Scenario 1: SQL data in a Docker container, training a Keras neural network model

Link: https://github.com/data-engineering-scenarios/kaggle-sql-jupyter-keras

Scenario notes:

  • Don't reinvent the wheel, use pre-trained models and APIs
  • Cover different challenges (OOM and large training sets), fuel/kerosene and helper libraries, HDF5 compression/storage, sparse events or large feature sets
  • Scenario template: JS frontend, Flask glue, Keras/other Python backend

Scenario ideas:

  • Pre-trained image recognition model, prediction of type of object, wrap front-end with graphs to show data, objects detected, etc.
  • Trained face differences, upload two faces, give prediction.

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.


Flags