Big Data and Machine Learning Fundamentals

Module 1

Overview of fundamentals course

Interesting question: why would Google be in the business of cloud computing?

Mission statement: to organize the world's information and make it accessible

Reason for being in cloud computing is, need to have massive amount of infrastructure in order to organize info and make it accessible

1 out of every 5 CPUs that is produced in the world is bought by Google

Organizing information:

GFS and Hadoop:

GFS (2002) was originally idea for organizing lots of files/information across large clusters, which in turn led to Hadoop HDFS (which is based on GFS)
MapReduce came out of Google around 2004
But, by 2006, Google was no longer writing any MapReduce programs
Why?
MapReduce and HDFS require sharding - distributing your data set across a cluster - which means that the size of your data sets and the size of your cluster are intimately linked

Note: link to all papers is here: https://research.google.com/pubs/papers.html

Google Data Technologies:

GFS - 2002 (basis for HDFS)
MapReduce - 2004 (basis for Hadoop - abandoned)
BigTable - 2006
Dremel - 2008 (replaced MapReduce, available in GCP as BigQuery)
Colossus - 2009 (replacement for GFS)
Flume - 2010 (replaced MapReduce, available in GCP as DataFlow)
Megastore - 2011
Spanner - 2012
Millwheel - 2013 (also part of DataFlow)
PubSub - 2013 (available in GCP as itself)
F1 - 2014
TensorFlow - 2015 (available in GCP as CloudML)

Various innovations coming out of Google are being released into Google Cloud

Elastic computing concept - you should be able to "instantaneously" scale out to as many machines as you need

Purpose of switching to the cloud:

Uptime, keeping hardware up and running
Making teams more efficient and effective
Having the entire Google data stack available to leverage the best software available

Big Data products

Spotify uses two products: PubSub and Dataflow

PubSub is a messaging system, Dataflow is a data pipeline tool

Using GCP big data products helps companies:

pay less per operation
be more efficient (better tooling)
be more innovative and powerful (big stack of data tools)

BigQuery: reducing 2.2 BILLION items to 20K items in <1 min (transformational promise of the cloud)

A functional view:

Foundation
- Compute engine, compute storage
Databases
- Datastore, Cloud SQL, Cloud BigTable
Analytics and Machine Learning
- BigQuery, Cloud Datalab, Translate API etc.
Data-Handling Frameworks
- Cloud PubSub, Cloud Dataflow, Cloud Dataproc

Why the forked approach?

Google is trying to solve SEVERAL DIFFERENT problems

Changing where people are computing

Keep doing the same things you're doing already, but changing where you're doing them
Each tool addresses different things that people are already doing on-premises (and would not require a change in CODE, just a change in LOCATION)
Cloud databases - (migrating DBs) Cloud SQL (relational databases, key-value databases, NoSQL databases), Cloud Datastore, Cloud BigTable
Storage platform - (migrating storage) Cloud Storage Standard, Durable Reduced Availability
Managed Hadoop/Spark/Pig/Hive - (migrating data processing) Cloud Dataproc

Providing speed, scalability, and reliability:

Want to provide scalable and reliable services (like Spotify)
Need to be able to justify using hundreds of machines for a few minutes, rather than a smaller number of machines that take much, much longer
Messaging - Cloud PubSub
Data Processing - Cloud Dataflow, Cloud Dataproc

Changing how computation is done:

Utilizing tools provided by Google to do new things, analyze more data, analyze in a different way, build better models
Examples: analyze customer behavior, analyze factory floors
There are basically three use-cases that typically play out
Data exploration and business intelligence - Cloud Datalab, Cloud Data Studio
Data Warehouse for large-scale analytics - Google BigQuery
Machine learning - Cloud Machine Learning, Vision API, Speech API, Translate API

Summary: three principal use-cases for GCP

(Based on what Google sees in their professional services organization)
Migrations - changing where they compute
Scale up and reliability - making a service more scalable/reliable
Transforming business - adding new ways to deal with more data

Usage Scenarios

Google Cloud platform usage scenarios: review

Change where you compute (migration to the cloud)
Scalability and reliability (flexible platform that can scale)
Change how you compute (explore, analyze, extract information differently)

Usage scenarios:

Changing where you compute: Movie company using cloud platform for scaling up rendering (can requisition more machines)
Scalability and reliability: Finance company performing consolidated audit (data repository of all equities, options, orders, quotes, events on stock market) - 6 TB per hour, 100 BILLION market events) (HUMONGOUS amount of data, that needs to be processed at scale, and none of it can be lost)
Changing how you compute: Rooms to Go (furniture retailer) combined CRM database and website, BigQuery analysis, redesign room packages

Spend less on ops/admin

Incorporate real-time data into apps/architecture

Apply machine learning broadly

Create citizen data scientists (putting tools into hands of domain experts)

This means your company can become a data-driven organization - decision-makers (domain experts) are no longer waiting for the data, they can deal with and see the data themselves to make the decisions and move forward

Labs

List of code labs: https://codelabs.developers.google.com/cpb100

Signing up for free trial (req. CC): https://console.developers.google.com/freetrial

Note they specifically say, you get $300 in credit over 60 days, and will not be charged.

Need to "activate" compute engine.

Quiz

Goals:

~~Get compute instance fired up, figure out how to use the control panel~~ (the control panel is a bit overwhelming at first, but once you've gone through the process of creating the compute engine, you get the hang of it. Between this and the theoretical coverage of which products do what, the myriad options become a lot more manageable.)
Load a computational combustion data set into a graph database
Load a Docker image into a Google Cloud compute instance
Utilize Apache Giraph to perform a graph analysis

Module 2

Foundations

Three components of computing systems:

Computing
Storage
Networking

The foundations of Google Cloud are the computing and storage:

Compute engine
Cloud storage

(Network layer is mostly transparent)

GCP can be thought of as an earth-scale computer

CPUs provided by compute engine virtual machines

Hard drive/storage is provided by cloud storage

Network connections is the global private network (invisible layer)

Design is based on the scalable, no-ops idea

Custom machine types: https://cloud.google.com/custom-machine-types/

Compute engine pricing: https://cloud.google.com/compute/pricing

Can use preconfigured machine types (set price), or can use custom machine types (custom cores/memory, variable price)

Think about it abstractly: "I want a virtual machine that has 8 CPUs and 30 GB RAM"

GCP figures out how to requisition the necessary hardware

Using a node for long periods of time leads to steeper discounts

Preemptible virtual machine: https://cloud.google.com/preemptible-vms/

Get an 80% discount if you agree to give it up if someone else pays full price for it

Why do this? Hadoop jobs are fault tolerant (if a machine goes down, the data is redistributed)

Example: Dataproc cluster (Dataproc is the Google Cloud version of Hadoop)

Can use 10 standard VMs as your "backbone", and then use 30 preemptible VMs

If the preemptible VMs go down, no problem - Hadoop is designed to be robust to hardware going down

This makes it 4x faster than 10 VMs alone, and you get 80% discount on 30 VMs

Lab: Starting Compute Engine

Clicking "Compute Engine" in the side menu of Google Cloud control panel automatically activates Compute Engine

Lists VM Instances

Create a new instance called my-first-instance

Link to info on free compute nodes: https://cloud.google.com/compute/pricing

Extensive list of options when creating a new node:

Machine type - up to 64 cores, custom amount of memory, can even choose CPU architecture (Skylake/Broadwell), GPUs
Boot disk and OS - several options, Debian, Ubuntu, CentOS, CoreOS, SUSE, Windows Server; can also ask for several different disk sizes
Identity and API access - drop-down to select different service accounts; can turn on/off access for different Cloud APIs (BigQuery, BigTable Admin, BigTable Data, Cloud Datastore, PubSub, Cloud SQL, etc.)
Management - can set startup scripts, set labels (arbitrary key-value pairs to help organize instances, e.g., production/staging/development, environments, services), set metadata (arbitrary key-value pairs too...?), set whether instance is preemptible (24 hrs max)
Disks - can set disk encryption, encryption keys, add additional disks
Networking - add additional networking devices
SSH - can copy and paste an SSH key for passwordless access (you copy a public key from the computer that will be SSHing into the compute instance)

More info on the SSH key thing: https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys?hl=en_US#instance-only

When you add an SSH public key to the metadata of an instance, it allows the person that corresponds to that public key to access the machine. (In other words, any public key you add to the instance's SSH keys goes into the SSH list of authorized keys)

$ cat ~/.ssh/id_rsa.pub

Then copy and paste the output of this into a new SSH key. Note that this will automatically populate a username that corresponds with that SSH key, based on the username/contact details. You need to use this username, not root/other.

Example startup script to install and run an apache server: https://cloud.google.com/compute/docs/startupscript?hl=en_US

I set a simple startup script on this instance to install git and cowsay:

#!/bin/sh
apt-get install -y git cowsay

Initially I tried to SSH into the compute instance using the username root:

$ ssh root@<ip-address-of-instance>
permission denied (publickey)

This failed because the username I was using did not match the username corresponding to the public key I had initialized the compute instance with.

Tried changing the SSH public key while the compute instance was running - this is possible to do and pretty easyl:

Went into GC control panel for Compute Engine Virtual Machines
Found my-first-instance
Clicked Edit
Looked pretty much exactly like the setup options page
Added my RSA public key from Cronus
Clicked Save

No dice, still not working. (The issue, as I discovered later, was not using the correct username that corresponded to the SSH key.)

Google Cloud takes care of the SSH keys if you connect using the web panel, or using gcloud command line tool. You have to manage SSH keys manually if you're connecting via SSH manually. I wanted to make sure I could SSH into the compute instance manually.

This page describes the ssh command syntax to specify which private key to use: https://cloud.google.com/compute/docs/instances/connecting-to-instance

$ ssh -i /path/to/private/key <user>@<ip-of-compute-instance>

(This is convenient if you want to create a new, separate public key specific to different compute instances.)

This page describes where to find your public/private key pair: https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys#locatesshkeys

(All fine so far, no new information. Still not working. ssh -vv doesn't reveal any obvious problems.)

Solution: the issue was with the username I was using. When I copied and pasted my public SSH key into the list of SSH keys on the compute instance, in the Google Cloud control panel, it automatically populated the username with "charlesreid1" based on the email address associated with the public key.

All I had to do was SSH with that username:

$ ssh -v -i .ssh/id_rsa charlesreid1@<ip-of-compute-instance>

and voila!

Adding cowsay to the list of software that's initially installed, since I'm not sure if git is automatically installed...

Hit "reset" to reset the instance from scratch...

Everything worked like a charm.

Global Filesystem

We're talking about using data in Cloud SQL, BigQuery, and Dataproc (Google's versions of large scale big data stuff like Hadoop, Hive, Sqoop, Pig, etc etc)

We want to get data from "out there in the world" into the cloud. The problem is, when you allocate a compute engine, you allocate a disk associated with that compute engine - and when the compute engine goes away, the disk goes away too. Plus, persistent disk space is expensive anyway.

Instead, store your data in "Cloud Storage". This stores raw data and stages it for other products. This storage is durable, persistent storage that can be easily replicated across other nodes and utilized in other GCP products (Cloud SQL, BigQuery, Dataproc).

Your first step in the journey of doing big data in the cloud is to get your data into cloud storage. To do that, use gsutil.

Interacting with Cloud Storage

Main article: Gsutil

Simplest way is to use command line utility called gsutil to interact with Cloud Storage. (It's a command line utility, install it using Google-provided installer.)

Note: you can also use programming language, GCP Console, or REST API.

Now, can use gsutil command line and utilities like cp, rm, mv, ls, mb, rb, rsync, acl...

To copy to Google Storage (GS), run a command like:

$ gsutil cp sales*.csv gs://acme-sales/data/

This copies data into GS buckets. The folder structure is purely convenience.

Buckets are like a domain name in your GCP project. Bucket name must be unique. Typically related to business/company domain name (GCP will require you to prove you own the domain name). Or, you can use a unique "dummy" bucket name.

This will be a recurring pattern: anything you can do from gsutil command line just invokes a REST API. Anything that can be done with a REST API can also be done from any language that speaks HTTP (just about any).

Data Handling

Transfer services: useful for ingestion of data from data center, local system, AWS buckets, other sources. Can be one-time or recurring.

Cloud storage as staging area: useful for importing data into analysis tools and databases. Also useful for sstaging to disk for fast access.

Bucket access control: project-level (only editors of projects can add/remove files from a bucket), bucket-level, and object-level access control. Can control who is responsible for paying. Can make buckets publicly accessible (take advantage of reliability/caching/speed of Google data centers to create a content distribution network).

Zones and Regions

Can control zones and regions where data is located

If speed important, can choose closest zone and region to increase speed

If reliability important, can choose to distribute data across zones of a particular region in case one center has interruption in service

If global access important, can distribute apps across multiple regions

Lab: Interacting with Cloud Storage

Running the lab "Interact with Cloud Storage"

Multi-step process:

Ingesting data into a compute engine instance
Transforming data on the compute engine instance
Storing data in persistent cloud storage
Publishing data to the web via cloud storage

Use git to clone repository with instructions/data/scripts

The behind-the-scenes procedure is as follows:

download earthquake data using wget (ingest.sh)
install extra Python goodies (install_extras.sh)
transform data using a Python script (creates a basemap projection, then plots lat/long locations of earthquakes using dots scaled to magnitude of earthquake, colors indicate "class" of magnitudes 1-3, 3-5, 5+) (transform.py)

This results in an image file. There is already an HTML file that will serve up the image on a web page.

Now back to the GCP console - create a storage bucket by going to storage in LHS menu.

Create bucket, and pick zone/region.

Called it mah-bukkit

Now it takes you to an interface where you can actually upload files directly from the browser (looks almost like Dropbox).

Finally, we can use gsutil to copy the image file and HTML file to the bucket. (gsutil is automatically installed on all Google Cloud compute instances.)

$ gsutil cp earthquakes* gs://mah-bukkit/
Copying file://earthquakes.csv [Content-Type=text/csv]...
AccessDeniedException: 403 Insufficient OAuth2 scope to perform this operation.
Acceptable scopes: https://www.googleapis.com/auth/cloud-platform

Turns out, when you create your virtual machine instance, you have to specify permissions for each API. This was something I left as default originally.

The first option is to set the service account - this is what allows you to control access to different buckets for different people.

The second options relates to API access. I left it as "default access," which does not allow the compute instance to access very many APIs. I changed this to "Allow full access to all Cloud APIs" (can also set access for each API individually).

Note that this can't be changed while running, you have to shut down the instance to change the API access for a compute instance.

Note that this may still fail with the same 403 mentioned above. If so, it's because gsutil is using crusty credentials. Reset them via:

$ rm -rf ~/.gsutil

then try again:

$ gsutil cp earthquakes* gs://mah-bukkit/
Copying file://earthquakes.csv [Content-Type=text/csv]...
Copying file://earthquakes.htm [Content-Type=text/html]...
Copying file://earthquakes.png [Content-Type=image/png]...
- [3 files][660.4 KiB/660.4 KiB]
Operation completed over 3 objects/660.4 KiB.

Now go to the cloud storage console, click the bucket, check "share publicly", and get the link.

https://storage.googleapis.com/mah-bukkit/earthquakes.htm

Bingo!

Cloud Shell

Because what we were doing here was relatively simple, and just involved shuffling some scripts around and copying data to cloud storage, it is overkill to allocate an entire compute instance to do that, and have to wait for it to start up and shut down, etc.

Instead, we could use the cloud shell - this is a serverless instance that can be used to do minor tasks.

Here's how this works: this is like a head node on a cluster, where you get "free" cycles to do minor tasks. Here's what you get:

MicroVM
Single 2.2 GHz Intel Xeon CPU
5 GB persistent storage in your home directory (place to save files) - stuff is already present! You get your "cloud home directory" (commonly-used scripts, repos, code, etc.)
Access to basic tools like gsutil, cloud/app engine sdks, docker, git, build tools, etc.
Access to languages: Python, Java, Go, and Node

The shell works like a Lish shell, opens within a split screen in the browser.

Can use this to launch serverless operations, requisition nodes, perform gsutil tasks, etc.

Resources

Module 1 Resources

Code labs for this course: https://codelabs.developers.google.com/cpb100

About google data centers: https://www.google.com/about/datacenters/

Whitepaper on Google's security practices (i.e., why you can trust Google to handle your cloud stuff): https://cloud.google.com/security/whitepaper

Module 2 Resources

Compute engine: https://cloud.google.com/compute

Storage: https://cloud.google.com/storage

Pricing calculator: https://cloud.google.com/pricing

Cloud launcher: https://cloud.google.com/launcher

YouTube video on Compute Instance vs Container Engine vs App Engine vs Cloud Functions: https://www.youtube.com/watch?v=g0dN8Hkh5H8

Cloud launcher:

Shortcut for getting a compute engine VM with preconfigured software ready to go
Google click to deploy is a Google-maintained VM image with the software already installed and ready to go

GCDEC/Fundamentals/Notes

From charlesreid1

Contents