Google Cloud/Weather Data in BigQery
From charlesreid1
Main link: https://google.qwiklabs.com/focuses/5689?locale=en
Qwiklab
Summary
This lab combines public data sets from NOAA and 311 in NYC. The end goal is to create a model for correlations between number of calls and weather.
Aspects of Google Cloud highlighted for scientists: serverless, easy to use, scalable, sharable
Steps
Overview of the basic steps:
- Set up a Google Cloud account (follow qwiklabs instructions, or use own)
- Explore weather data and 311 data in BigQuery
- Find correlation between weather and complaints
Setting Up Google Cloud Account
Won't cover this here. Use a free trial or use the Qwiklab-provided account.
Exploring Data
To enable the NOAA weather data set, navigate to the BigQuery Console and pick your project.
Now on the left, where the project name is listed, there is a triangle drop-down button next to it.
Click that and pick "Switch to project" and "Display project..." to display the weather data set.
Before running SQL queries, check to make sure you are running SQL queries in legacy mode.
Weather Data
Here is a sample SQL query:
SELECT
-- Create a timestamp from the date components.
stn,
TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
-- Replace numerical null values with actual null
AVG(IF (temp=9999.9,
null,
temp)) AS temperature,
AVG(IF (wdsp="999.9",
null,
CAST(wdsp AS Float64))) AS wind_speed,
AVG(IF (prcp=99.99,
0,
prcp)) AS precipitation
FROM
`bigquery-public-data.noaa_gsod.gsod20*`
WHERE
CAST(YEAR AS INT64) > 2010
AND CAST(MO AS INT64) = 6
AND CAST(DA AS INT64) = 12
AND (stn="725030" OR -- La Guardia
stn="744860") -- JFK
GROUP BY
stn,
timestamp
ORDER BY
timestamp DESC,
stn ASC
Let's break it down to understand what it is doing...
- First, we are selecting a timestamp, plus averages of 3 fields, for each station that is returned by our query. That's the bulk of the SELECT statement.
- The FROM part of the query indicates which table we're taking data from - the range of our historical data.
- We want to limit the data to a particular date, so we limit to each June 12 after 2010.
- Finally, we limit the results to the LaGuardia and JFK airport stations.
So, what does this query do? It plots weather conditions at NYC's two airports for June 12, year over year, starting in 2011
311 Data
To query the NYC 311 data, which is what the original qwiklab was analyzing, we can use the following query:
SELECT
EXTRACT(YEAR
FROM
created_date) AS year,
complaint_type,
COUNT(1) AS num_complaints
FROM
`bigquery-public-data.new_york.311_service_requests`
GROUP BY
year,
complaint_type
ORDER BY
num_complaints DESC
What does this query do? It is asking for information about 311 complaints - specifically, it is counting the number of complaints per year, grouped by complaint type.
- The SELECT statement extracts the year from the timestamp when the complaint was created
- The FROM statement specifies that we want to use NYC 311 data
- GROUP BY means we group the results according to these two dimensions - year and complaint_type
- We order the results by number of complaints, largest to smallest.