Units & Lessons
Unit 1 - Essentials Concepts of Data Science
This lesson describes the role of data and the data scientist in decision making and explains the overarching principles of collecting, integrating, and analyzing data. It explains the general concept of "big data" and the common challenges faced when working with large data sets. Introduces the 6 V's of big data and the CRISP-DM framework for data mining.
Unit 2 - Programming in R for Data Science
R is an important and one of the most powerful statistical programming environments. It is often the first choice for data science. Programming in R requires the R toolkit, while development and scripting in R is facilitated through the RStudio IDE.
This lesson introduces basic programming concepts in the R programming environment. In particular, it teaches how to create, display, and manipulate data sets and data frames -- two of the most important object types in R. In addition, functions will be used to organize code. Next, if statements and loops are presented with many examples. Lastly, the lesson shows how to import and export data in text files.
Unit 3 - Data Collection & Integration
This lesson looks more closely at the R programming constructs needed to manipulate data. In particular, the lesson introduces the basic R functions necessary for cleaning data sets and converting them to storable structures ready for statistical analysis or predictive modeling.
In this lesson you will be introduced to the R functions and packages required to import data from a variety of text files, including comma separated value (CSV) files, tab delimited files, Excel files, R object files (.RData), and foreign statistical package files (SPSS, Stata, and SAS). In addition, importing XML encoded objects using XML parsing is covered.
Data is not always neatly available in CSV, Excel, or text files. A lot of interesting data is published on web sites but those web sites do not make their data available for download. In this module you will learn how to retrieve data from web pages through a process known as "web scraping".
Many websites make their data available through APIs which allow programmers to retrieve specific pieces of data rather than the entire website content. This lessons explains how to retrieve data from web pages through a tool known as a "web API".
Unit 4 - Data Storage
This lesson introduces the relational model for data storage. Relational databases store data in interconnected tables (called relations) that are accessed through the imperative query language SQL. Relational (or SQL) databases are the most common data storage form other than text files.
While relational database management system (RDBMS) are the workhorse of the data storage world, they are based on a set-model that is not appropriate for certain kinds of data storage needs. Recently, non-relational databases that use query languages other than SQL have become more common. This lesson looks at key-value, columnar, document, and graph databases as alternatives to relational databases.
Unit 5 - Data Analytics
Descriptive analytics is a preliminary stage of data processing that creates a summary of historical data to yield useful information and possibly prepare the data for further analysis. Data aggregation and data mining methods organize the data and make it possible to identify patterns and relationships in it that would not otherwise be visible. Descriptive analytics provides information about what happened, while prescriptive analytics attempts for determine what is likely to happen in the future.
Predictive analytics is the branch of data mining concerned with the prediction of future probabilities and trends. The central element of predictive analytics is the predictor, a variable that can be measured for an individual or other entity to predict future behavior. This lesson explains several predictive models for forecasting.
Results and insights from data are often best communicated with charts and graphs. This lesson explains how to build visualization that clearly show analysis results.
This lessons shows how to combine visualizations and narratives to explain the results of an analytics effort. It focuses on report organization, writing styles, and clear communication. It also addresses ethical issues in reporting.
Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. This lesson introduces some key algorithms of machine learning, including classification and clustering, through visual examples.
Unit 6 - Python Programming for Data Science
Python is an important programming language for data science. This lesson introduces the Python programming language with a focus on how the language is used to shape and analyze data.
Unit 7 - Data Quality & Governance
Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise and used in data analytics efforts. This lessons shows how a sound data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures.
To be useful and yield proper analytical results, data must be of sufficient quality. As data volume increases, the question of internal consistency within data becomes significant, regardless of fitness for use for any particular external purpose. This lesson investigates the key dimensions of data quality and data quality assurance.
Data Science is applied in a number of industries and domains with varying success and varying ethical conundrums. This lessons takes a look at the profession of data science and its impact on business.
Skills Sidebar I: Basic of Data for Analytics
Skills Sidebar II: Essential Excel Modeling
Covers essential Excel skills necessary for information and data work.