Extra Credit Assignment 2
Learning Objectives
- read and parse JSON
- retrieve data from JSON
- learn to sample data for testing
Data Files
Tasks
Before diving into the programming problems, study the data file that is provided for the assignment:
If the size or the file or the number of records is overwhelming, you should build a smaller sample dataset for testing that loads faster and is representative of the file (build a random sample of rows). This is a common technique when building data loaders. This technique is for debugging purposes only. Your submitted assignment should work against the complete data set.
- (30 points) Load IMdb movie listings from the file linked above. Note that the file is compressed so you need to figure out how to uncompress it in R. Inspect the file and determine how to best load it -- this is not an XML file and requires custom string parsing.
- (20 points) Parse the data. You should identify all the fields and their meanings within the file. Place the data into a data frame suitable for further analysis.
- (10 points) Comment your code where you identify the movie rows that are part of your result set.
- (20 points) Your result set should only contain the movie title and movie release year. Your result set should NOT include rows for TV shows. You can identify the movies within the data file (look for a special marking field or some other indication). Make any other assumptions you need, but comment your assumptions. For a cleaner result set, look for duplicate titles and remove the duplicates.
- (20 points) For correct syntax, coding style and readable code format.
If the size or the file or the number of records is overwhelming, you should build a smaller sample dataset for testing that loads faster and is representative of the file (build a random sample of rows). This is a common technique when building data loaders. This technique is for debugging purposes only. Your submitted assignment should work against the complete data set.
Deliverables & Submission Instructions
You need to submit an .R extension file. Be sure to state all the assumptions and give explanations as comments in the .R file wherever needed to help us assess your submission. Please name the submission file LAST_FirstInitial_EC2.R for example for John Smith’s assignment, the file should be named Smith_J_EC2.R. Make sure that whatever you submit works; no credit will be given for code that does not work. Upload the submission to Blackboard. Make sure you follow the R Programming Style Guide.
Scoring
Total Number of Earnable Points: +100
Approximate Time to Complete: 6-10 hours
Due Date: see Calendar or Blackboard
Approximate Time to Complete: 6-10 hours
Due Date: see Calendar or Blackboard