Data Engineering: Difference between revisions
From charlesreid1
No edit summary |
No edit summary |
||
| Line 2: | Line 2: | ||
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned. | In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned. | ||
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows: | |||
==Dataproc== | |||
This is the "classic" big data technology - distributed computing on clusters. | |||
* Dataproc - Google Cloud version, allocate a cluster and run jobs through it | |||
* Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework | |||
* Spark - similar to Hadoop, but more focused on efficient computation | |||
* PySpark - Python bindings for Spark (Java) | |||
* SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations | |||
* Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs | |||
* Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce) | |||
* HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only | |||
* Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store | |||
* Parquet - column-based table storage that sits on Hadoop | |||
Revision as of 20:36, 13 October 2017
Data Engineering Scenarios
In line with the data-engineering-scenarios Github organization that I created (https://github.com/data-engineering-scenarios), this page will contain notes on different scenarios - both finished and planned.
These scenarios focus on different technologies available via Google Cloud or Amazon Web Services. Roughly, they can be grouped as follows:
Dataproc
This is the "classic" big data technology - distributed computing on clusters.
- Dataproc - Google Cloud version, allocate a cluster and run jobs through it
- Hadoop - the big data technology that started it all; processing data in parallel on nodes using MapReduce framework
- Spark - similar to Hadoop, but more focused on efficient computation
- PySpark - Python bindings for Spark (Java)
- SparkSQL - allows SQL queries in Spark programs, e.g., running an SQL query on Hive, and passing the results to Spark computations
- Pig - works with Hadoop; higher-level scripting language that shortens Hadoop jobs
- Hive - data warehouse that sits on Hadoop (or Pig); gives SQL-like interface to query data. (SQL queries are implemented in MapReduce)
- HBase - Java software for non-relational databases, analogous to Google's BigTable; runs on Hadoop, can serve as source/sink for MapReduce queries, is a column-based key store; no SQL queries - MapReduce only
- Phoenix - turns HBase (non-relational, non-SQL database) into an SQL-like data store
- Parquet - column-based table storage that sits on Hadoop
GCDEC
Working through the Google Cloud Data Engineer certification course... See GCDEC for pages related to that.