Assignment 9
Learning Objectives
- design a relational data storage structure
- implement ETL for data
- create a SQLite database
- connect to SQLite from R
- create tables using SQL scripts
- add data to tables via CSV loads
- retrieve data into R data frame using SQL
Data Files & Tools
Tasks
The objective of this assignment is to learn how to store data in a relational database and retrieve it for processing in R. Revisit the Bird Strike data set that we previously worked with. The data file contains the following data fields (ignore any additional fields in the file and limit your database to these). Note that not all fields have values. If a field does not have values, set it to empty. The name is parenthesis is the field name you will use in the database in Question (1).
Questions 1 to 3 must be in a single R Notebook that is separate from the R Notebook for Question 5.
- Aircraft Type (AircraftType)
- Airport Name (DepAirport)
- Effect: Impact to Flight (Impact)
- Flight Date (Date)
- Record ID (ID)
- Airline/Operator (Airline)
- Origin State (OriginState)
- Feet above Ground (Ft)
- New Field: Altitude
Questions 1 to 3 must be in a single R Notebook that is separate from the R Notebook for Question 5.
- (10 Points) Create a separate notebook file for this question. This is the extraction, loading, and transform (ETL) phase. Load the full data file into R and then extract the proper columns. After that, deal with missing values.
- (10 Points) Perform a min-max normalization on the 'Feet Above Ground' field and place the value into the field Altitude.
- (10 Points) Save the new data frame to a CSV file.
- (35 Points) After installing SQLite (or MySQL), write a SQL script to create a table for your database using the fields described above. Create the database using the script and then load the data from the CSV file from (3) file into the table using a batch load directly into the database. Note that different versions of SQLite have different commands for loading an existing, i.e., previously created, database into a current session. On some versions the command is .open while on others it is .load. To find out what commands are available, use .help or browse the documentation on sqlite.org.
- (35 Points) In a separate notebook, connect to the database created in (4) and then build SQL SELECT statements for the following queries and execute them from within R to show that the data is loaded properly into the database (display the returned data after each query):
a) (5 Points) What is the total number of birdstrike incidents?
b) (10 Points) List the number of birdstrike incidents by airline. c) (5 Points) How many birdstrike incidents occurred on departures (origin airport) from Boston? d) (5 Points) How many birdstrike incidents occurred above 10,000 feet? e) (10 Points) How many birdstrike incidents occurred each month for a given year? |
Deliverables & Submission Instructions
Submit the SQL scripts and the two .Rmd plus your .nb.html file generated by the R Notebooks combined into a zip file. Upload the zip file to Blackboard.
Scoring
Total Number of Earnable Points: 100
Approximate Time to Complete: 4-6 hours
Due Date: see Calendar or Blackboard
Approximate Time to Complete: 4-6 hours
Due Date: see Calendar or Blackboard