From charlesreid1

Main link: https://google.qwiklabs.com/focuses/5689?locale=en

Qwiklab

Summary

This lab combines public data sets from NOAA and 311 in NYC. The end goal is to create a model for correlations between number of calls and weather.

Aspects of Google Cloud highlighted for scientists: serverless, easy to use, scalable, sharable

Steps

Overview of the basic steps:

  • Set up a Google Cloud account (follow qwiklabs instructions, or use own)
  • Explore weather data and 311 data in BigQuery
  • Find correlation between weather and complaints

Setting Up Google Cloud Account

Won't cover this here. Use a free trial or use the Qwiklab-provided account.

Exploring Data

To enable the NOAA weather data set, navigate to the BigQuery Console and pick your project.

Now on the left, where the project name is listed, there is a triangle drop-down button next to it.

Click that and pick "Switch to project" and "Display project..." to display the weather data set.

Before running SQL queries, check to make sure you are running SQL queries in legacy mode.

Weather Data

Here is a sample SQL query:

SELECT
  -- Create a timestamp from the date components.
  stn,
  TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
  -- Replace numerical null values with actual null
  AVG(IF (temp=9999.9,
      null,
      temp)) AS temperature,
  AVG(IF (wdsp="999.9",
      null,
      CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (prcp=99.99,
      0,
      prcp)) AS precipitation
FROM
  `bigquery-public-data.noaa_gsod.gsod20*`
WHERE
  CAST(YEAR AS INT64) > 2010
  AND CAST(MO AS INT64) = 6
  AND CAST(DA AS INT64) = 12
  AND (stn="725030" OR  -- La Guardia
    stn="744860")    -- JFK
GROUP BY
  stn,
  timestamp
ORDER BY
  timestamp DESC,
  stn ASC

Let's break it down to understand what it is doing...

  • First, we are selecting a timestamp, plus averages of 3 fields, for each station that is returned by our query. That's the bulk of the SELECT statement.
  • The FROM part of the query indicates which table we're taking data from - the range of our historical data.
  • We want to limit the data to a particular date, so we limit to each June 12 after 2010.
  • Finally, we limit the results to the LaGuardia and JFK airport stations.

So, what does this query do? It plots weather conditions at NYC's two airports for June 12, year over year, starting in 2011


311 Data

To query the NYC 311 data, which is what the original qwiklab was analyzing, we can use the following query:

SELECT
  EXTRACT(YEAR
  FROM
    created_date) AS year,
  complaint_type,
  COUNT(1) AS num_complaints
FROM
  `bigquery-public-data.new_york.311_service_requests`
GROUP BY
  year,
  complaint_type
ORDER BY
  num_complaints DESC

What does this query do? It is asking for information about 311 complaints - specifically, it is counting the number of complaints per year, grouped by complaint type.

  • The SELECT statement extracts the year from the timestamp when the complaint was created
  • The FROM statement specifies that we want to use NYC 311 data
  • GROUP BY means we group the results according to these two dimensions - year and complaint_type
  • We order the results by number of complaints, largest to smallest.


Searching for Correlations

For the second part of the lab, create a new table to store results.

Click the triangular drop down arrow next to project name, and pick "Create new dataset".

Click "Compose New Query". Can call it nyc_weather or something.

Step 1: Construct Weather Data Table

Step 1 will be constructing a lookup table of dates and weather characteristics. We will construct a query of the NOAA weather data, and the results of this query will populate the table nyc_weather (through the magic of BigQuery).

SELECT
  -- Create a timestamp from the date components.
  timestamp(concat(year,"-",mo,"-",da)) as timestamp,
  -- Replace numerical null values with actual nulls
  AVG(IF (temp=9999.9, null, temp)) AS temperature,
  AVG(IF (visib=999.9, null, visib)) AS visibility,
  AVG(IF (wdsp="999.9", null, CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (gust=999.9, null, gust)) AS wind_gust,
  AVG(IF (prcp=99.99, null, prcp)) AS precipitation,
  AVG(IF (sndp=999.9, null, sndp)) AS snow_depth

FROM
  `bigquery-public-data.noaa_gsod.gsod20*`

WHERE
  CAST(YEAR AS INT64) > 2008
  AND (stn="725030" OR  -- La Guardia
       stn="744860")    -- JFK
GROUP BY timestamp

Again, we're looking at the two airport weather stations.

  • The SELECT statement extracts the year, month, and day as our timestamp. It also asks for the average of several wind, precipitation, and temperature variables over observations for that day.
  • The FROM statement just indicates we are looking at weather data
  • The WHERE forces the query to look at data more recent than 2008 and from JFK or LaGuardia airport weather stations only.

Now the nyc_weather table has a more convenient intermediate query result, which is a list of timestamps and the corresponding weather that day.

Now we can cross-reference this from 311 results, and do so more easily than before.

Flags