GCDEC/Fundamentals/Notes: Difference between revisions
From charlesreid1
(→Lab 1) |
(→Quiz 1) |
||
| Line 139: | Line 139: | ||
Need to "activate" compute engine. | Need to "activate" compute engine. | ||
===Quiz | ===Quiz=== | ||
Goals: | Goals: | ||
| Line 146: | Line 146: | ||
* Load a Docker image into a Google Cloud compute instance | * Load a Docker image into a Google Cloud compute instance | ||
* Utilize Apache Giraph to perform a graph analysis | * Utilize Apache Giraph to perform a graph analysis | ||
==Module 2== | |||
Three components of computing systems: | |||
* Computing | |||
* Storage | |||
* Networking | |||
The foundations of Google Cloud are the computing and storage: | |||
* Compute engine | |||
* Cloud storage | |||
(Network layer is mostly transparent) | |||
GCP can be thought of as an earth-scale computer | |||
CPUs provided by compute engine virtual machines | |||
Hard drive/storage is provided by cloud storage | |||
Network connections is the global private network (invisible layer) | |||
Design is based on the scalable, no-ops idea | |||
Custom machine types: https://cloud.google.com/custom-machine-types/ | |||
Compute engine pricing: https://cloud.google.com/compute/pricing | |||
Can use preconfigured machine types (set price), or can use custom machine types (custom cores/memory, variable price) | |||
Think about it abstractly: "I want a virtual machine that has 8 CPUs and 30 GB RAM" | |||
GCP figures out how to requisition the necessary hardware | |||
Using a node for long periods of time leads to steeper discounts | |||
Preemptible virtual machine: https://cloud.google.com/preemptible-vms/ | |||
Get an 80% discount if you agree to give it up if someone else pays full price for it | |||
Why do this? Hadoop jobs are fault tolerant (if a machine goes down, the data is redistributed) | |||
Example: Dataproc cluster (Dataproc is the Google Cloud version of Hadoop) | |||
Can use 10 standard VMs as your "backbone", and then use 30 preemptible VMs | |||
If the preemptible VMs go down, no problem - Hadoop is designed to be robust to hardware going down | |||
This makes it 4x faster than 10 VMs alone, and you get 80% discount on 30 VMs | |||
=Resources= | =Resources= | ||
Revision as of 22:42, 19 September 2017
Notes
Module 1
What is the course about
Interesting question: why would Google be in the business of cloud computing?
Mission statement: to organize the world's information and make it accessible
Reason for being in cloud computing is, need to have massive amount of infrastructure in order to organize info and make it accessible
1 out of every 5 CPUs that is produced in the world is bought by Google
Organizing information:
GFS and Hadoop:
- GFS (2002) was originally idea for organizing lots of files/information across large clusters, which in turn led to Hadoop HDFS (which is based on GFS)
- MapReduce came out of Google around 2004
- But, by 2006, Google was no longer writing any MapReduce programs
- Why?
- MapReduce and HDFS require sharding - distributing your data set across a cluster - which means that the size of your data sets and the size of your cluster are intimately linked
Note: link to all papers is here: https://research.google.com/pubs/papers.html
Google Data Technologies:
- GFS - 2002 (basis for HDFS)
- MapReduce - 2004 (basis for Hadoop - abandoned)
- BigTable - 2006
- Dremel - 2008 (replaced MapReduce, available in GCP as BigQuery)
- Colossus - 2009 (replacement for GFS)
- Flume - 2010 (replaced MapReduce, available in GCP as DataFlow)
- Megastore - 2011
- Spanner - 2012
- Millwheel - 2013 (also part of DataFlow)
- PubSub - 2013 (available in GCP as itself)
- F1 - 2014
- TensorFlow - 2015 (available in GCP as CloudML)
Various innovations coming out of Google are being released into Google Cloud
Elastic computing concept - you should be able to "instantaneously" scale out to as many machines as you need
Purpose of switching to the cloud:
- Uptime, keeping hardware up and running
- Making teams more efficient and effective
- Having the entire Google data stack available to leverage the best software available
Big Data products
Spotify uses two products: PubSub and Dataflow
PubSub is a messaging system, Dataflow is a data pipeline tool
Using GCP big data products helps companies:
- pay less per operation
- be more efficient (better tooling)
- be more innovative and powerful (big stack of data tools)
BigQuery: reducing 2.2 BILLION items to 20K items in <1 min (transformational promise of the cloud)
A functional view:
- Foundation
- Compute engine, compute storage
- Databases
- Datastore, Cloud SQL, Cloud BigTable
- Analytics and Machine Learning
- BigQuery, Cloud Datalab, Translate API etc.
- Data-Handling Frameworks
- Cloud PubSub, Cloud Dataflow, Cloud Dataproc
Why the forked approach?
Google is trying to solve SEVERAL DIFFERENT problems
Changing where people are computing
- Keep doing the same things you're doing already, but changing where you're doing them
- Each tool addresses different things that people are already doing on-premises (and would not require a change in CODE, just a change in LOCATION)
- Cloud databases - (migrating DBs) Cloud SQL (relational databases, key-value databases, NoSQL databases), Cloud Datastore, Cloud BigTable
- Storage platform - (migrating storage) Cloud Storage Standard, Durable Reduced Availability
- Managed Hadoop/Spark/Pig/Hive - (migrating data processing) Cloud Dataproc
Providing speed, scalability, and reliability:
- Want to provide scalable and reliable services (like Spotify)
- Need to be able to justify using hundreds of machines for a few minutes, rather than a smaller number of machines that take much, much longer
- Messaging - Cloud PubSub
- Data Processing - Cloud Dataflow, Cloud Dataproc
Changing how computation is done:
- Utilizing tools provided by Google to do new things, analyze more data, analyze in a different way, build better models
- Examples: analyze customer behavior, analyze factory floors
- There are basically three use-cases that typically play out
- Data exploration and business intelligence - Cloud Datalab, Cloud Data Studio
- Data Warehouse for large-scale analytics - Google BigQuery
- Machine learning - Cloud Machine Learning, Vision API, Speech API, Translate API
Summary: three principal use-cases for GCP
- (Based on what Google sees in their professional services organization)
- Migrations - changing where they compute
- Scale up and reliability - making a service more scalable/reliable
- Transforming business - adding new ways to deal with more data
Usage Scenarios
Google Cloud platform usage scenarios: review
- Change where you compute (migration to the cloud)
- Scalability and reliability (flexible platform that can scale)
- Change how you compute (explore, analyze, extract information differently)
Usage scenarios:
- Changing where you compute: Movie company using cloud platform for scaling up rendering (can requisition more machines)
- Scalability and reliability: Finance company performing consolidated audit (data repository of all equities, options, orders, quotes, events on stock market) - 6 TB per hour, 100 BILLION market events) (HUMONGOUS amount of data, that needs to be processed at scale, and none of it can be lost)
- Changing how you compute: Rooms to Go (furniture retailer) combined CRM database and website, BigQuery analysis, redesign room packages
Spend less on ops/admin
Incorporate real-time data into apps/architecture
Apply machine learning broadly
Create citizen data scientists (putting tools into hands of domain experts)
This means your company can become a data-driven organization - decision-makers (domain experts) are no longer waiting for the data, they can deal with and see the data themselves to make the decisions and move forward
Labs
List of code labs: https://codelabs.developers.google.com/cpb100
Signing up for free trial (req. CC): https://console.developers.google.com/freetrial
Note they specifically say, you get $300 in credit over 60 days, and will not be charged.
Need to "activate" compute engine.
Quiz
Goals:
Get compute instance fired up, figure out how to use the control panel(the control panel is a bit overwhelming at first, but once you've gone through the process of creating the compute engine, you get the hang of it. Between this and the theoretical coverage of which products do what, the myriad options become a lot more manageable.)- Load a computational combustion data set into a graph database
- Load a Docker image into a Google Cloud compute instance
- Utilize Apache Giraph to perform a graph analysis
Module 2
Three components of computing systems:
- Computing
- Storage
- Networking
The foundations of Google Cloud are the computing and storage:
- Compute engine
- Cloud storage
(Network layer is mostly transparent)
GCP can be thought of as an earth-scale computer
CPUs provided by compute engine virtual machines
Hard drive/storage is provided by cloud storage
Network connections is the global private network (invisible layer)
Design is based on the scalable, no-ops idea
Custom machine types: https://cloud.google.com/custom-machine-types/
Compute engine pricing: https://cloud.google.com/compute/pricing
Can use preconfigured machine types (set price), or can use custom machine types (custom cores/memory, variable price)
Think about it abstractly: "I want a virtual machine that has 8 CPUs and 30 GB RAM"
GCP figures out how to requisition the necessary hardware
Using a node for long periods of time leads to steeper discounts
Preemptible virtual machine: https://cloud.google.com/preemptible-vms/
Get an 80% discount if you agree to give it up if someone else pays full price for it
Why do this? Hadoop jobs are fault tolerant (if a machine goes down, the data is redistributed)
Example: Dataproc cluster (Dataproc is the Google Cloud version of Hadoop)
Can use 10 standard VMs as your "backbone", and then use 30 preemptible VMs
If the preemptible VMs go down, no problem - Hadoop is designed to be robust to hardware going down
This makes it 4x faster than 10 VMs alone, and you get 80% discount on 30 VMs
Resources
Module 1
Code labs for this course: https://codelabs.developers.google.com/cpb100
About google data centers: https://www.google.com/about/datacenters/
Whitepaper on Google's security practices (i.e., why you can trust Google to handle your cloud stuff): https://cloud.google.com/security/whitepaper