Lecture Notes: Web Scraping
Data is not always neatly available as a downloadable CSV (or similar) file. Chances are that much third party and local government data is only available through viewing of a web page. While a data scientist might be inclined to check if there's a web API, many such sites don't offer that as well. However, any content that can be viewed can be "scraped" from the page through programming. While there are numerous automated scraping platform available (e.g., Kimono, import.io, and Google Chrome even has an add-on through the Chrome Store), they are often not sufficient or convenient to use. Therefore, a data scientist must know how toprogrammatically retrieve data from a static web page through HTML parsing and scraping the content. Once the content is retrieved it can then be saved in any other format for archival purposes.