From charlesreid1

Revision as of 20:50, 13 October 2017 by Admin (talk | contribs) (→‎Dataproc)

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

This is the "classic" big data technology - distributed computing on clusters.

Google Cloud product:

  • Dataproc - allocate clusters, run jobs

Amazon product:

  • Amazon EC2 - allocate clusters, run jobs

Hadoop ecosystem:

  • Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
  • Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
  • Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
  • HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
  • Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
  • Parquet - column-based table storage that sits on Hadoop

Spark technologies:

  • Spark - similar to Hadoop, but more focused on efficient computation
  • PySpark - Python bindings for Spark (Java)
  • SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Dataflow

Google Cloud product:

  • Dataflow - building data processing pipelines for transforming streams, with sources/sinks
  • PubSub - (unordered) streaming events and messaging
  • Difference - PubSub is a messaging service that provides JUST ONE OF MANY sources/sinks for Dataflow

Amazon product:

  • Kinesis - streaming events? messaging?

Apache projects:

  • Kafka - publishing and subscribing to message streams, stream-processing, and storage of messages in fault-tolerant clusters

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.


Flags