Leveraging Unstructured Data with Cloud Dataproc

Module 1: Introduction to Cloud Dataproc

Overview of Unstructured Data=

In industry:

Unstructured data is often the hardest to deal with
Four sources of data: (data you have, and analyze) (data you don't have, but wish you had) (data you have, but don't analyze) (data that you could easily acquire)
Focus on data you have, but don't analyze
Several reasons why you wouldn't analyze it - volume, quality, velocity, and lack of structure
Most common difficulty is lack of structure - we don't have the kinds of tools to analyze unstructured data that we have to analyze structured data
Example: emails from customers. Newsgroups. Photographs taken in the field (inspections).
If we don't have tools for dealing with images, or with free-form text, we will ignore it ("too hard")

Example:

Google Street View imagery collected
Huge collection of unstructured imagery
Collected, used, and shown to users of Google Maps
Initially, no technology in place to utilize it any further
Once deep learning came along, went back and looked at that imagery
Could analyze the imagery, extract information, use it to enhance the maps

Extrapolating:

You may have a lot of unstructured data laying around in the company, serving one specific purpose
Typically, it is given to a human user to analyze, and that's the end - because there is no automated system, no automated tools, for dealing with that unstructured data
Google Cloud provides machine learning tools to analyze unstructured data (freeform text, images, etc.)
You don't have to start from scratch - using the ML APIs, can take advantage of pre-built machine learning models to extract information that is useful

Counting Problems

Consider MapReduce tasks: these can usually be boiled down to counting problems

Example: delays in payment processing

Not just counting, but essentially, lots of counting problems
Counting how long it took you, counting number of occurrences, counting a mean
MapReduce is full of easy counting problems - large set of data, apply specific operations, count occurrences

Harder counting problems:

How often are programmers checking in low-quality code?
This is still a counting problem, but now it's hard to actually determine what to count
How do you determine what "low quality code" means?
How can you extract information like "low quality" or "high quality" from code?
Many analytical tools are about counting problems - but some counting problems are easier than others

Why Dataproc

What is a petabyte?

27 years to download a petabyte over 4G
100 libraries of congress
Every tweet ever tweeted......... TIMES 50
2 micrograms of DNA
1 day of videos uploaded to YouTube

Scaling up: bigger machine

Scaling out: more machines

MapReduce approach - split the data, compute nodes process data local to it

Big data open source stack:

Hadoop HDFS
On top of that, Pig/Hive/HBase/Spark

Challenge with clusters:

If you're using it 24/7, you need a bigger cluster
If you're not using it 24/7, you are using your cluster inefficiently
Need a shared resource that can be provisioned when you need it, not used when you don't

YARN = Yet Another Resource Negotiator (MapReduce 2.0)

Dataproc cluster allocated through menu:

We don't have to do anything (like connecting to worker nodes) except specify how many nodes we need
Bucket allocated for the cluster - this is how the worker nodes interact with the master nodes

Creating Dataproc Clusters

Philosophy: one cluster for one job

Select zone closest to your data (e.g., Cloud SQL)

Data coming into the data center will not cost you
Egress data (data leaving the data center) does cost you
Having Dataproc cluster in a different zone from your data will lead to more costs

Note: HDFS by default creates 3 copies of the data

We want our Hadoop jobs to directly use Cloud Storage, in place of HDFS - don't want to use HDFS to store input/output data

Preemptible worker nodes - we can provision worker nodes that can, potentially, get kicked off our cluster if a higher-priority job comes along

Want a core of nodes that are high-priority, but can also allocate a large number of pre-emptible worker nodes that may potentially "go away" (Hadoop is resilient to machines going down, so no problemo)

If we allocate preemptible compute nodes, these will match the types (CPU and memory) of regular worker nodes

They also won't have any primary disk associated with them, since they can't be part of HDFS (we don't want them to hold any data permanently, since they may disappear)

Versions of Dataproc:

Spark, Hadoop, Pig, Hive versions
Google Cloud Storage-Hadoop connector
BigQuery-Hadoop connector

To create a Dataproc cluster: can use web console, or can use gcloud SDK (command line)

To create these resources from a command line:

gcloud dataproc clusters create my-creative-cluster-name \
--zone us-central1-a \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 50 \
--num-workers 2 \
--worker-machine-type n1-standard-1 \
--worker-boot-disk-size 50

To get some help on this:

$ gcloud dataproc --help
$ gcloud dataproc clusters --help 
$ gcloud dataproc clusters create --help

Custom machine types:

Control the types of machines in the cluster
Cloud Network for networking
Cloud IAM for security
Master node running on compute engine
Persistent workers running on compute engine
HDFS and Google Cloud Storage as a shared filesystem between nodes

Machine types:

Standard-4 means 4 cpus, 1 GB per cpu
Highmem instances double the amount of memory per core
Custom type: first dash number specifies number of CPUs, second dash number specifies amount of memory in MB

Example:

gcloud dataproc clusters create test-cluster \
--worker-machine-type custom-6-30720 \
--master-machine-type custom-6-23040

Custom instance: 6 CPUs, 5 GB per core:

That's 30 GB total
30 * 1024 = 30720 MB
Hence, the flag custom-6-30720

Note that you can also set up a custom cluster, and get the REST command that has the parameters you want, to set it up on the command line.

Creating Dataproc Clusters Lab

Lab goals:

Create console
SSH to cluster
Get to the browser management for Hadoop (this requires a firewall rule change to allow your local machine to reach the page)
Create, manage, delete Dataproc clusters from CLI

Create a Dataproc cluster:

Standard size: 1 cpu, 2 nodes
Disk space: 10 GB each
Once cluster is up, go to Compute Engine and find the master cluster. Click SSH.

If you want to SSH in, you can just use a browser window.

To enable SSH to the cluster from an arbitrary client:

Edit the master node in the Compute Engine section of the console
Scroll down to the SSH Keys section
Add the public key from the machine that will be connecting
Click save
Take note of the username assigned, just to the left of where you copy and paste your key
Now you can SSH in

Once you're in:

$ python --version
Python 2.7.9

$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-1~bpo8+1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

$ scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

$ pyspark --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_131
Branch dataproc-branch-1.2
Compiled by user  on 2017-08-09T20:35:29Z
Revision 002ee20ea40d88d051e0a7c5bd4a7c07721dbddc
Url https://bigdataoss-internal.googlesource.com/third_party/apache/bigtop
Type --help for more information.

$ pig --version
Apache Pig version 0.16.0 (r: unknown)
compiled Aug 09 2017, 20:25:20

$ hive --version
Hive 2.1.1
Subversion git://build-dataproc-1-2/mnt/ram/bigtop/bigtop/output/hive/hive-2.1.1 -r 002ee20ea40d88d051e0a7c5bd4a7c07721dbddc
Compiled by bigtop on Wed Aug 9 19:46:33 UTC 2017
From source with checksum a7e70a0bdc3bd77d45c8738c41ac59eb

$ sudo su

# apt-get install inetutils-tools

# ifconfig
eth0      Link encap:Ethernet  HWaddr 42:01:0a:8a:00:03
          inet addr:10.138.0.3  Bcast:10.138.0.3  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:1460  Metric:1
          RX packets:9088 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6308 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2222892 (2.1 MiB)  TX bytes:1654499 (1.5 MiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:4322 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4322 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:521771 (509.5 KiB)  TX bytes:521771 (509.5 KiB)

To restrict access via SSH to your machine and your machine only:

Side menu > Networking > VPC Networking > Firewall Rules
Create a new firewall rule
Targets: all instances in network
Source IP ranges: enter your IP address, followed by /32
Protocols and ports: allow specific protocols: tcp:8088;tcp:50070;tcp:8080
Create the rule

The purpose of this was to allow us to access port 8088/8080/5007 via browser

Copy IP address of master node from the Compute Engine console, and go to http://35.197.115.38:8088/

Port 50070: no such luck. Not open.

$ nmap -Pn -p 8088 35.197.115.38/32

Starting Nmap 7.60 ( https://nmap.org ) at 2017-09-24 14:54 PDT
Nmap scan report for 38.115.197.35.bc.googleusercontent.com (35.197.115.38)
Host is up (0.075s latency).

PORT     STATE SERVICE
8088/tcp open  radan-http

Nmap done: 1 IP address (1 host up) scanned in 0.26 seconds

$ nmap -Pn -p 50070 35.197.115.38/32

Starting Nmap 7.60 ( https://nmap.org ) at 2017-09-24 14:54 PDT
Nmap scan report for 38.115.197.35.bc.googleusercontent.com (35.197.115.38)
Host is up (0.070s latency).

PORT      STATE  SERVICE
50070/tcp closed unknown

Nmap done: 1 IP address (1 host up) scanned in 0.16 seconds

Side note: a bit confused about SSH firewall rules. Once I have my public key incorporated into the Compute Engine instance, there was no need for a firewall rule. But the "Connecting to a Linux Instance" documentation states:

"Note: Your Google Cloud Platform VPC network must have one or more firewall rules that allow SSH connections on port 22. The firewall rules must allow SSH connections for the IP ranges or specific IP addresses from which you want to connect."

Link: https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh

Also looks like you can have a VPN network that connects to the Google Cloud Platform VPC. That makes the Google Compute nodes into internal IP addresses on the network.

Bastion instances: these are special nodes that have an internal and an external IP address, so that you can use them as a gateway for connecting to a Google Compute Engine network

Link: https://cloud.google.com/solutions/connecting-securely#bastion

Creating Dataproc Clusters from Command Line Lab

Start by installing gcloud command line tool:

$ brew install caskroom/cask/google-cloud-sdk

Add this to bashrc:

source '/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/path.bash.inc'
source '/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/completion.bash.inc'

Now authorize google cloud. If you try and allocate compute nodes before authenticating, you'll see:

$ gcloud dataproc clusters create my-fancy-cluster \
 --zone us-west1-a \
 --master-machine-type n1-standard-1 \
 --master-boot-disk-size 20 \
 --num-workers 2 \
 --worker-machine-type n1-standard-1 \
 --worker-boot-disk-size 20
ERROR: (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Permission denied on resource project quiet-era-180418 (#0)
- '@type': type.googleapis.com/google.rpc.Help
  links:
  - description: Google developers console
    url: https://console.developers.google.com

To authenticate:

$ gcloud init

Link: https://cloud.google.com/sdk/docs/authorizing

Cloud Dataproc Remote Conections

Secure connections to VM: https://cloud.google.com/solutions/connecting-securely#bastion

How to protect services on machines with external IPs
How to connect to machines that do not have external IPs
Firewall rules

Connecting to Linux instances: https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh

How to connect to compute engine instances

Module 2: Running Dataproc jobs

References

Flags

GCDEC/Unstructured Data/Notes

From charlesreid1

Contents