GCDEC/Unstructured Data/Notes
From charlesreid1
Leveraging Unstructured Data with Cloud Dataproc
Module 1: Introduction to Cloud Dataproc
Overview of Unstructured Data=
In industry:
- Unstructured data is often the hardest to deal with
- Four sources of data: (data you have, and analyze) (data you don't have, but wish you had) (data you have, but don't analyze) (data that you could easily acquire)
- Focus on data you have, but don't analyze
- Several reasons why you wouldn't analyze it - volume, quality, velocity, and lack of structure
- Most common difficulty is lack of structure - we don't have the kinds of tools to analyze unstructured data that we have to analyze structured data
- Example: emails from customers. Newsgroups. Photographs taken in the field (inspections).
- If we don't have tools for dealing with images, or with free-form text, we will ignore it ("too hard")
Example:
- Google Street View imagery collected
- Huge collection of unstructured imagery
- Collected, used, and shown to users of Google Maps
- Initially, no technology in place to utilize it any further
- Once deep learning came along, went back and looked at that imagery
- Could analyze the imagery, extract information, use it to enhance the maps
Extrapolating:
- You may have a lot of unstructured data laying around in the company, serving one specific purpose
- Typically, it is given to a human user to analyze, and that's the end - because there is no automated system, no automated tools, for dealing with that unstructured data
- Google Cloud provides machine learning tools to analyze unstructured data (freeform text, images, etc.)
- You don't have to start from scratch - using the ML APIs, can take advantage of pre-built machine learning models to extract information that is useful
Counting Problems
Consider MapReduce tasks: these can usually be boiled down to counting problems
Example: delays in payment processing
- Not just counting, but essentially, lots of counting problems
- Counting how long it took you, counting number of occurrences, counting a mean
- MapReduce is full of easy counting problems - large set of data, apply specific operations, count occurrences
Harder counting problems:
- How often are programmers checking in low-quality code?
- This is still a counting problem, but now it's hard to actually determine what to count
- How do you determine what "low quality code" means?
- How can you extract information like "low quality" or "high quality" from code?
- Many analytical tools are about counting problems - but some counting problems are easier than others
Why Dataproc
What is a petabyte?
- 27 years to download a petabyte over 4G
- 100 libraries of congress
- Every tweet ever tweeted......... TIMES 50
- 2 micrograms of DNA
- 1 day of videos uploaded to YouTube
Scaling up: bigger machine
Scaling out: more machines
MapReduce approach - split the data, compute nodes process data local to it
Big data open source stack:
- Hadoop HDFS
- On top of that, Pig/Hive/HBase/Spark
Challenge with clusters:
- If you're using it 24/7, you need a bigger cluster
- If you're not using it 24/7, you are using your cluster inefficiently
- Need a shared resource that can be provisioned when you need it, not used when you don't
YARN = Yet Another Resource Negotiator (MapReduce 2.0)
Dataproc cluster allocated through menu:
- We don't have to do anything (like connecting to worker nodes) except specify how many nodes we need
- Bucket allocated for the cluster - this is how the worker nodes interact with the master nodes
Creating Dataproc Clusters
Philosophy: one cluster for one job
Select zone closest to your data (e.g., Cloud SQL)
- Data coming into the data center will not cost you
- Egress data (data leaving the data center) does cost you
- Having Dataproc cluster in a different zone from your data will lead to more costs
Note: HDFS by default creates 3 copies of the data
We want our Hadoop jobs to directly use Cloud Storage, in place of HDFS - don't want to use HDFS to store input/output data
Preemptible worker nodes - we can provision worker nodes that can, potentially, get kicked off our cluster if a higher-priority job comes along
Want a core of nodes that are high-priority, but can also allocate a large number of pre-emptible worker nodes that may potentially "go away" (Hadoop is resilient to machines going down, so no problemo)
If we allocate preemptible compute nodes, these will match the types (CPU and memory) of regular worker nodes
They also won't have any primary disk associated with them, since they can't be part of HDFS (we don't want them to hold any data permanently, since they may disappear)
Versions of Dataproc:
- Spark, Hadoop, Pig, Hive versions
- Google Cloud Storage-Hadoop connector
- BigQuery-Hadoop connector
To create a Dataproc cluster: can use web console, or can use gcloud SDK (command line)
To create these resources from a command line:
gcloud dataproc clusters create my-creative-cluster-name \ --zone us-central1-a \ --master-machine-type n1-standard-1 \ --master-boot-disk-size 50 \ --num-workers 2 \ --worker-machine-type n1-standard-1 \ --worker-boot-disk-size 50
To get some help on this:
$ gcloud dataproc --help $ gcloud dataproc clusters --help $ gcloud dataproc clusters create --help
Custom machine types:
- Control the types of machines in the cluster
- Cloud Network for networking
- Cloud IAM for security
- Master node running on compute engine
- Persistent workers running on compute engine
- HDFS and Google Cloud Storage as a shared filesystem between nodes
Machine types:
- Standard-4 means 4 cpus, 1 GB per cpu
- Highmem instances double the amount of memory per core
- Custom type: first dash number specifies number of CPUs, second dash number specifies amount of memory in MB
Example:
gcloud dataproc clusters create test-cluster \ --worker-machine-type custom-6-30720 \ --master-machine-type custom-6-23040
Custom instance: 6 CPUs, 5 GB per core:
- That's 30 GB total
- 30 * 1024 = 30720 MB
- Hence, the flag custom-6-30720
Note that you can also set up a custom cluster, and get the REST command that has the parameters you want, to set it up on the command line.
Creating Dataproc Clusters Lab
Lab goals:
- Create console
- SSH to cluster
- Get to the browser management for Hadoop (this requires a firewall rule change to allow your local machine to reach the page)
- Create, manage, delete Dataproc clusters from CLI
Create a Dataproc cluster:
- Standard size: 1 cpu, 2 nodes
- Disk space: 10 GB each
- Once cluster is up, go to Compute Engine and find the master cluster. Click SSH.
If you want to SSH in, you can just use a browser window.
To enable SSH to the cluster from an arbitrary client:
- Edit the master node in the Compute Engine section of the console
- Scroll down to the SSH Keys section
- Add the public key from the machine that will be connecting
- Click save
- Take note of the username assigned, just to the left of where you copy and paste your key
- Now you can SSH in
Once you're in:
$ python --version
Python 2.7.9
$ java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-1~bpo8+1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
$ scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
$ pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_131
Branch dataproc-branch-1.2
Compiled by user on 2017-08-09T20:35:29Z
Revision 002ee20ea40d88d051e0a7c5bd4a7c07721dbddc
Url https://bigdataoss-internal.googlesource.com/third_party/apache/bigtop
Type --help for more information.
$ pig --version
Apache Pig version 0.16.0 (r: unknown)
compiled Aug 09 2017, 20:25:20
$ hive --version
Hive 2.1.1
Subversion git://build-dataproc-1-2/mnt/ram/bigtop/bigtop/output/hive/hive-2.1.1 -r 002ee20ea40d88d051e0a7c5bd4a7c07721dbddc
Compiled by bigtop on Wed Aug 9 19:46:33 UTC 2017
From source with checksum a7e70a0bdc3bd77d45c8738c41ac59eb
$ sudo su
# apt-get install inetutils-tools
# ifconfig
eth0 Link encap:Ethernet HWaddr 42:01:0a:8a:00:03
inet addr:10.138.0.3 Bcast:10.138.0.3 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1460 Metric:1
RX packets:9088 errors:0 dropped:0 overruns:0 frame:0
TX packets:6308 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2222892 (2.1 MiB) TX bytes:1654499 (1.5 MiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:4322 errors:0 dropped:0 overruns:0 frame:0
TX packets:4322 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:521771 (509.5 KiB) TX bytes:521771 (509.5 KiB)
To restrict access via SSH to your machine and your machine only:
- Side menu > Networking > VPC Networking > Firewall Rules
- Create a new firewall rule
- Targets: all instances in network
- Source IP ranges: enter your IP address, followed by /32
- Protocols and ports: allow specific protocols: tcp:8088;tcp:50070;tcp:8080
- Create the rule
The purpose of this was to allow us to access port 8088/8080/5007 via browser
Copy IP address of master node from the Compute Engine console, and go to http://35.197.115.38:8088/
Port 50070: no such luck. Not open.
$ nmap -Pn -p 8088 35.197.115.38/32 Starting Nmap 7.60 ( https://nmap.org ) at 2017-09-24 14:54 PDT Nmap scan report for 38.115.197.35.bc.googleusercontent.com (35.197.115.38) Host is up (0.075s latency). PORT STATE SERVICE 8088/tcp open radan-http Nmap done: 1 IP address (1 host up) scanned in 0.26 seconds $ nmap -Pn -p 50070 35.197.115.38/32 Starting Nmap 7.60 ( https://nmap.org ) at 2017-09-24 14:54 PDT Nmap scan report for 38.115.197.35.bc.googleusercontent.com (35.197.115.38) Host is up (0.070s latency). PORT STATE SERVICE 50070/tcp closed unknown Nmap done: 1 IP address (1 host up) scanned in 0.16 seconds
Side note: a bit confused about SSH firewall rules. Once I have my public key incorporated into the Compute Engine instance, there was no need for a firewall rule. But the "Connecting to a Linux Instance" documentation states:
"Note: Your Google Cloud Platform VPC network must have one or more firewall rules that allow SSH connections on port 22. The firewall rules must allow SSH connections for the IP ranges or specific IP addresses from which you want to connect."
Link: https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh
Also looks like you can have a VPN network that connects to the Google Cloud Platform VPC. That makes the Google Compute nodes into internal IP addresses on the network.
Bastion instances: these are special nodes that have an internal and an external IP address, so that you can use them as a gateway for connecting to a Google Compute Engine network
Link: https://cloud.google.com/solutions/connecting-securely#bastion
Creating Dataproc Clusters from Command Line Lab
Start by installing gcloud command line tool:
$ brew install caskroom/cask/google-cloud-sdk
Add this to bashrc:
source '/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/path.bash.inc' source '/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/completion.bash.inc'
Now authorize google cloud. If you try and allocate compute nodes before authenticating, you'll see:
$ gcloud dataproc clusters create my-fancy-cluster \
--zone us-west1-a \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 20 \
--num-workers 2 \
--worker-machine-type n1-standard-1 \
--worker-boot-disk-size 20
ERROR: (gcloud.dataproc.clusters.create) PERMISSION_DENIED: Permission denied on resource project quiet-era-180418 (#0)
- '@type': type.googleapis.com/google.rpc.Help
links:
- description: Google developers console
url: https://console.developers.google.com
To authenticate:
$ gcloud init
Link: https://cloud.google.com/sdk/docs/authorizing
Cloud Dataproc Remote Conections
Secure connections to VM: https://cloud.google.com/solutions/connecting-securely#bastion
- How to protect services on machines with external IPs
- How to connect to machines that do not have external IPs
- Firewall rules
Connecting to Linux instances: https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh
- How to connect to compute engine instances