From charlesreid1

Line 105: Line 105:
==Searching for Correlations==
==Searching for Correlations==


=Flags=
==Flags==


[[Category:Google Cloud]]
[[Category:Google Cloud]]

Revision as of 22:56, 8 January 2018

Main link: https://google.qwiklabs.com/focuses/5689?locale=en

Qwiklab

Summary

This lab combines public data sets from NOAA and 311 in NYC. The end goal is to create a model for correlations between number of calls and weather.

Aspects of Google Cloud highlighted for scientists: serverless, easy to use, scalable, sharable

Steps

Overview of the basic steps:

  • Set up a Google Cloud account (follow qwiklabs instructions, or use own)
  • Explore weather data and 311 data in BigQuery
  • Find correlation between weather and complaints

Setting Up Google Cloud Account

Won't cover this here. Use a free trial or use the Qwiklab-provided account.

Exploring Data

To enable the NOAA weather data set, navigate to the BigQuery Console and pick your project.

Now on the left, where the project name is listed, there is a triangle drop-down button next to it.

Click that and pick "Switch to project" and "Display project..." to display the weather data set.

Before running SQL queries, check to make sure you are running SQL queries in legacy mode.

Weather Data

Here is a sample SQL query:

SELECT
  -- Create a timestamp from the date components.
  stn,
  TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
  -- Replace numerical null values with actual null
  AVG(IF (temp=9999.9,
      null,
      temp)) AS temperature,
  AVG(IF (wdsp="999.9",
      null,
      CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (prcp=99.99,
      0,
      prcp)) AS precipitation
FROM
  `bigquery-public-data.noaa_gsod.gsod20*`
WHERE
  CAST(YEAR AS INT64) > 2010
  AND CAST(MO AS INT64) = 6
  AND CAST(DA AS INT64) = 12
  AND (stn="725030" OR  -- La Guardia
    stn="744860")    -- JFK
GROUP BY
  stn,
  timestamp
ORDER BY
  timestamp DESC,
  stn ASC

Let's break it down to understand what it is doing...

  • First, we are selecting a timestamp, plus averages of 3 fields, for each station that is returned by our query. That's the bulk of the SELECT statement.
  • The FROM part of the query indicates which table we're taking data from - the range of our historical data.
  • We want to limit the data to a particular date, so we limit to each June 12 after 2010.
  • Finally, we limit the results to the LaGuardia and JFK airport stations.

So, what does this query do? It plots weather conditions at NYC's two airports for June 12, year over year, starting in 2011


311 Data

To query the NYC 311 data, which is what the original qwiklab was analyzing, we can use the following query:

SELECT
  EXTRACT(YEAR
  FROM
    created_date) AS year,
  complaint_type,
  COUNT(1) AS num_complaints
FROM
  `bigquery-public-data.new_york.311_service_requests`
GROUP BY
  year,
  complaint_type
ORDER BY
  num_complaints DESC

What does this query do? It is asking for information about 311 complaints - specifically, it is counting the number of complaints per year, grouped by complaint type.

  • The SELECT statement extracts the year from the timestamp when the complaint was created
  • The FROM statement specifies that we want to use NYC 311 data
  • GROUP BY means we group the results according to these two dimensions - year and complaint_type
  • We order the results by number of complaints, largest to smallest.


Searching for Correlations

Flags