Revision as of 20:36, 13 October 2017

Data Engineering Scenarios

In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.

These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:

Dataproc

This is the "classic" big data technology - distributed computing on clusters.

Dataproc - Google Cloud version, allocate a cluster and run jobs through it

Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework

Spark - similar to Hadoop, but more focused on efficient computation
PySpark - Python bindings for Spark (Java)
SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations

Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs

Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
Parquet - column-based table storage that sits on Hadoop

GCDEC

Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.

Flags

@@ Line 2: / Line 2: @@
 In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
+These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
+==Dataproc==
+This is the "classic" big data technology - distributed computing on clusters.
+* Dataproc - Google Cloud version, allocate a cluster and run jobs through it
+* Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
+* Spark - similar to Hadoop, but more focused on efficient computation
+* PySpark - Python bindings for Spark (Java)
+* SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
+* Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
+* Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
+* HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
+* Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
+* Parquet - column-based table storage that sits on Hadoop

Data Engineering: Difference between revisions

From charlesreid1

Revision as of 20:36, 13 October 2017

Contents

Data Engineering Scenarios

Dataproc

GCDEC

Flags