Revision as of 23:13, 17 October 2017

Serverless Data Analysis with Dataflow

Module 2: Data Processing Pipelines with Dataflow

What Is Dataflow

Dataflow:

Way to execute data processing pipelines on the cloud
Flexible sources/sinks (e.g., read from BigQuery and write to Cloud Storage)
Steps - transforms - are elastic, can be scaled to more machines as needed
Code is written using open source API (Apache Beam)
Cloud Dataflow is the Apache Beam "pipeline service"
Other Apache Beam pipeline services: Flink, Spark
Example: read from GCS, perform filtering, perform grouping, perform transform, then write results to GCS

Each step: user-defined code (Java or Python classes)

ParDo - can run a particular transform in the context of a parallel do

Why Dataflow?

Batch or streaming
Cloud Storage - batch data (e.g., historical data) source
Cloud PubSub - streaming source
Can use the SAME PIPELINE for both scenarios
Can have Dataflow write to various sinks
BigQuery - batch results storage sink
Cloud Storage - batch results storage sink
PubSub - streaming results sink

For streaming cases:

Define a sliding window for streaming data
Change input and output to read from an UNBOUNDED source
Then define a window, e.g., 60 minutes

Writing Data Pipelines

Can write pipelines in Java or Python

Concepts:

Pipeline - set of steps (transforms)
The pipeline is executed on the cloud by a runner
Apache Beam code forms the pipeline, Dataflow is the runner
Each step is elastically scaled
Source - where the input data comes from
Sink - where the transformed data goes

Pcollection:

Each transform on the pipeline takes a parallel collection (Pcollection) as an input
Pcollection - a list or map of items that does not need to be bounded by the size of the machine, does not need to fit into memory

Pipeline:

Directed graph of steps
Read in data, transform it, write data out
Example Java pipeline:

import org.apache.beam.sdk.Pipeline;

public static void main(String[] args) {
    // Create pipeline 
    // Parameterize with input args
    Pipeline p = Pipeline.create(PipelineOptionsFactoryfromArgs(arg));

    p.apply(TextIO.Read.from("gs://..."))   // Read the input
     .apply(new CountWords())               // Count (process) the text
     .apply(TextIO.Write.to("gs://..."));   // Write output to GCS

    // Now run the pipeline
    p.run();
}

p.run() executes the pipeline "graph" on the runner that will execute the pipeline

Direct runner - runs the pipeline on a single instance of the local machine

Dataflow runner - graph gets launched on the cloud

Python API: similar feel...

import apache_beam as beam

if __name__=="__main__":
    # Create pipeline 
    # Parameterize on input args
    p = beam.Pipeline(argv = sys.argv)

    (p
        | beam.io.ReadFromText("gs://...")              # Read input
        | beam.FlatMap(labda line: count_words(line))   # Process
        | beam.io.WriteToText("gs://...")               # Write output
    )

    p.run()     # Run the pipeline

Python uses the pipe operator to carry out transforms in sequence.

Step 1: create graph

Step 2: run it

Pcollections

Input to transform: Pcollection

Output from transform: Pcollection

All data in pipeline is represented with a Pcollection

# Java:
PCollection<String> lines = p.apply(...)

We can also define a transform to happen within a ParDo context, which will parallelize the transform, by defining a DoFn

Above, we define a collection of Strings called lines.

Below, we perform a transform for each line (each String in the collection called lines)

PCollection<Integer> sizes = 
    lines.apply("Length",
                parDo.of(new DoFn<String, Integer>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception{
                        String line = c.element();
                        c.output(line.length());
                    }
                }
            ));

Above - anonymous function that inherits from DoFn, defines processElement() method, result is a collection of integers that you can then transform with the next step

In Python:

lines = p | ...

Now, for every line that comes in, return the length of the line:

sizes = lines | "Length" >> beam.Map( lambda line : len(line))

This name is important - shows up in the monitoring console

Dataflow allows you to replace parts in a pipeline, WITHOUT ANY LOSS OF DATA (any data not processed by old pipeline will be processed by new pipeline)

But for that exchange of transforms to work, they need to have unique names

In Python, overloading >> so that "Length" >> beam.Map(...) means "call this map Length and have it perform a Map operation"

Data Pipelines

Data Pipelines Lab

MapReduce with Dataflow

MapReduce Lab

Side Inputs

Side Inputs Lab

Streaming Data into Dataflow

Streaming Lab

Resources

Flags

@@ Line 31: / Line 31: @@
 * Define a sliding window for streaming data
 * Change input and output to read from an UNBOUNDED source
 * Then define a window, e.g., 60 minutes
 ===Writing Data Pipelines===

GCDEC/Dataflow/Notes: Difference between revisions

From charlesreid1