GCDEC/Dataflow/Notes: Difference between revisions
From charlesreid1
| Line 155: | Line 155: | ||
In Python, overloading <code>>></code> so that <code>"Length" >> beam.Map(...)</code> means "call this map Length and have it perform a Map operation" | In Python, overloading <code>>></code> so that <code>"Length" >> beam.Map(...)</code> means "call this map Length and have it perform a Map operation" | ||
===Executing Pipelines=== | |||
To process data with pipelines, need to read data into pipelines | |||
Can read input data from GCS, BigQuery, PubSub | |||
Will look at a few example pipelines, first in Java, then in Python. | |||
====Java Pipelines==== | |||
Example of reading text into a String: | |||
<pre> | |||
// Note the wildcard syntax | |||
PCollection<String> lines = p.apply(TextIO.Read.from("gs://.../input-*.csv.gz") | |||
</pre> | |||
Example of reading data from PubSub: | |||
<pre> | |||
PCollection<String> lines = p.apply(PubsubIO.Read.from("input_topic")); | |||
</pre> | |||
Example of running a BigQuery query and returning a table row: | |||
<pre> | |||
String javaQuery = "SELECT x, y, z FROM [project:dataset.tablename]"; | |||
PCollection<TableRow> javaContent = p.apply(BigQueryIO.Read.fromQuery(javaQuery)); | |||
</pre> | |||
The last example returns a PCollection of TableRow objects | |||
Likewise with sinks - whatever you can read, you can also write | |||
Write data to file system, GCS, BigQuery, or PubSub | |||
<pre> | |||
lines.apply(TextIO.Write.to("/data/output").withSuffix(".txt") | |||
</pre> | |||
If the output files are very small, and you don't want to deal with the hassle of sharding, can also say: | |||
<pre> | |||
.apply(TextIO.Write.to("/data/output").withSuffix(".csv").withoutSharding() | |||
</pre> | |||
(Note that this requires all I/O to happen on a single machine) | |||
This may require transformation of relevant data from (whatever type) to String before writing out using TextIO (which only accepts Strings) | |||
To execute your pipeline, two options: | |||
* Option 1: run Java, pass it the classpath, give it the name of the file/class with the main method | |||
* Option 2: run using maven | |||
Option 2 looks like: | |||
<pre> | |||
mvn compile -e exec:java -Dexec.mainClass=${MAIN} | |||
</pre> | |||
And to run in the cloud, use mvn to submit the job to Dataflow | |||
<pre> | |||
mvn compile -e exec:java \ | |||
-Dexec.mainClass=$MAIN \ | |||
-Dexec.args="--project=$PROJECT \ | |||
--stagingLocation=gs://$BUCKET/staging/ \ | |||
--tempLocation=gs://$BUCKET/staging/ \ | |||
--runner=DataflowRunner" | |||
</pre> | |||
If using Java version of Dataflow, use Maven | |||
* Default is to use a local runner | |||
* Can add arguments to specify project (b/c this controls billing), staging location (where to stage code), temporary location (optional), and runner to use | |||
====Python Pipelines==== | |||
Python pipeline execution: | |||
Running locally just requires running the program without args: | |||
<pre> | |||
python ./my-grep.py | |||
</pre> | |||
To run in the cloud, specify parameters: | |||
<pre> | |||
python ./my-grep.py \ | |||
--project=$PROJECT \ | |||
--job_name=myjob \ | |||
--staging_location=gs://$BUCEKT/staging/ \ | |||
--temp_location=gs://$BUCKET/staging \ | |||
--runner=DataflowRunner | |||
</pre> | |||
Only difference is that Python requires a job name... | |||
So far, we looked at what Dataflow is and some simple Dataflow concepts | |||
Now we will write/implement a Dataflow pipeline | |||
===Data Pipelines Lab=== | ===Data Pipelines Lab=== | ||
Revision as of 23:15, 17 October 2017
Serverless Data Analysis with Dataflow
Module 2: Data Processing Pipelines with Dataflow
What Is Dataflow
Dataflow:
- Way to execute data processing pipelines on the cloud
- Flexible sources/sinks (e.g., read from BigQuery and write to Cloud Storage)
- Steps - transforms - are elastic, can be scaled to more machines as needed
- Code is written using open source API (Apache Beam)
- Cloud Dataflow is the Apache Beam "pipeline service"
- Other Apache Beam pipeline services: Flink, Spark
- Example: read from GCS, perform filtering, perform grouping, perform transform, then write results to GCS
Each step: user-defined code (Java or Python classes)
ParDo - can run a particular transform in the context of a parallel do
Why Dataflow?
- Batch or streaming
- Cloud Storage - batch data (e.g., historical data) source
- Cloud PubSub - streaming source
- Can use the SAME PIPELINE for both scenarios
- Can have Dataflow write to various sinks
- BigQuery - batch results storage sink
- Cloud Storage - batch results storage sink
- PubSub - streaming results sink
For streaming cases:
- Define a sliding window for streaming data
- Change input and output to read from an UNBOUNDED source
- Then define a window, e.g., 60 minutes
Data Pipelines
Can write pipelines in Java or Python
Concepts:
- Pipeline - set of steps (transforms)
- The pipeline is executed on the cloud by a runner
- Apache Beam code forms the pipeline, Dataflow is the runner
- Each step is elastically scaled
- Source - where the input data comes from
- Sink - where the transformed data goes
Pcollection:
- Each transform on the pipeline takes a parallel collection (Pcollection) as an input
- Pcollection - a list or map of items that does not need to be bounded by the size of the machine, does not need to fit into memory
Pipeline:
- Directed graph of steps
- Read in data, transform it, write data out
- Example Java pipeline:
import org.apache.beam.sdk.Pipeline;
public static void main(String[] args) {
// Create pipeline
// Parameterize with input args
Pipeline p = Pipeline.create(PipelineOptionsFactoryfromArgs(arg));
p.apply(TextIO.Read.from("gs://...")) // Read the input
.apply(new CountWords()) // Count (process) the text
.apply(TextIO.Write.to("gs://...")); // Write output to GCS
// Now run the pipeline
p.run();
}
p.run() executes the pipeline "graph" on the runner that will execute the pipeline
Direct runner - runs the pipeline on a single instance of the local machine
Dataflow runner - graph gets launched on the cloud
Python API: similar feel...
import apache_beam as beam
if __name__=="__main__":
# Create pipeline
# Parameterize on input args
p = beam.Pipeline(argv = sys.argv)
(p
| beam.io.ReadFromText("gs://...") # Read input
| beam.FlatMap(labda line: count_words(line)) # Process
| beam.io.WriteToText("gs://...") # Write output
)
p.run() # Run the pipeline
Python uses the pipe operator to carry out transforms in sequence.
Step 1: create graph
Step 2: run it
Pcollections
Input to transform: Pcollection
Output from transform: Pcollection
All data in pipeline is represented with a Pcollection
# Java: PCollection<String> lines = p.apply(...)
We can also define a transform to happen within a ParDo context, which will parallelize the transform, by defining a DoFn
Above, we define a collection of Strings called lines.
Below, we perform a transform for each line (each String in the collection called lines)
PCollection<Integer> sizes =
lines.apply("Length",
parDo.of(new DoFn<String, Integer>() {
@ProcessElement
public void processElement(ProcessContext c) throws Exception{
String line = c.element();
c.output(line.length());
}
}
));
Above - anonymous function that inherits from DoFn, defines processElement() method, result is a collection of integers that you can then transform with the next step
In Python:
lines = p | ...
Now, for every line that comes in, return the length of the line:
sizes = lines | "Length" >> beam.Map( lambda line : len(line))
This name is important - shows up in the monitoring console
Dataflow allows you to replace parts in a pipeline, WITHOUT ANY LOSS OF DATA (any data not processed by old pipeline will be processed by new pipeline)
But for that exchange of transforms to work, they need to have unique names
In Python, overloading >> so that "Length" >> beam.Map(...) means "call this map Length and have it perform a Map operation"
Executing Pipelines
To process data with pipelines, need to read data into pipelines
Can read input data from GCS, BigQuery, PubSub
Will look at a few example pipelines, first in Java, then in Python.
Java Pipelines
Example of reading text into a String:
// Note the wildcard syntax
PCollection<String> lines = p.apply(TextIO.Read.from("gs://.../input-*.csv.gz")
Example of reading data from PubSub:
PCollection<String> lines = p.apply(PubsubIO.Read.from("input_topic"));
Example of running a BigQuery query and returning a table row:
String javaQuery = "SELECT x, y, z FROM [project:dataset.tablename]"; PCollection<TableRow> javaContent = p.apply(BigQueryIO.Read.fromQuery(javaQuery));
The last example returns a PCollection of TableRow objects
Likewise with sinks - whatever you can read, you can also write
Write data to file system, GCS, BigQuery, or PubSub
lines.apply(TextIO.Write.to("/data/output").withSuffix(".txt")
If the output files are very small, and you don't want to deal with the hassle of sharding, can also say:
.apply(TextIO.Write.to("/data/output").withSuffix(".csv").withoutSharding()
(Note that this requires all I/O to happen on a single machine)
This may require transformation of relevant data from (whatever type) to String before writing out using TextIO (which only accepts Strings)
To execute your pipeline, two options:
- Option 1: run Java, pass it the classpath, give it the name of the file/class with the main method
- Option 2: run using maven
Option 2 looks like:
mvn compile -e exec:java -Dexec.mainClass=${MAIN}
And to run in the cloud, use mvn to submit the job to Dataflow
mvn compile -e exec:java \ -Dexec.mainClass=$MAIN \ -Dexec.args="--project=$PROJECT \ --stagingLocation=gs://$BUCKET/staging/ \ --tempLocation=gs://$BUCKET/staging/ \ --runner=DataflowRunner"
If using Java version of Dataflow, use Maven
- Default is to use a local runner
- Can add arguments to specify project (b/c this controls billing), staging location (where to stage code), temporary location (optional), and runner to use
Python Pipelines
Python pipeline execution:
Running locally just requires running the program without args:
python ./my-grep.py
To run in the cloud, specify parameters:
python ./my-grep.py \ --project=$PROJECT \ --job_name=myjob \ --staging_location=gs://$BUCEKT/staging/ \ --temp_location=gs://$BUCKET/staging \ --runner=DataflowRunner
Only difference is that Python requires a job name...
So far, we looked at what Dataflow is and some simple Dataflow concepts
Now we will write/implement a Dataflow pipeline