Data drives decision-making in most organizations. Because of that, data has become an important asset and source of competitive advantage. Data comes from many sources and in many formats. Unfortunately, the data is rarely in a format that is conducive to analysis. Therefore, the data scientist must "munge" and "wrangle" the data to suit the desired analytical and visualization goals. In practice that means format conversions, filling in missing data, converting data and fields to appropriate formats, and storing the data in a data store suitable for its intended purpose. Databases use different architectures to deal with different types and amounts of data. The data scientist must choose the appropriate database and storage architecture based on the data and how it will be used.
Big Data is a relatively new term that describes data sets that are so large and complex that traditional methods of storing and processing them are not sufficient. The exact size above which a data set is "big" is not clearly defined and depends on the domain, industry, and analytical goals. While there is not a single definition, the general consensus is that Big Data is the integration of large amounts of multiple types of structured and unstructured data into a single data set that can be analyzed to gain insight and new understanding.
Big Data can be understood through the six V's of volume, variety, velocity, veracity, validity, and volatility:
- Volume: enormous amounts of structured and unstructured data
- Variety: multiple data types including documents, images, videos, and time series
- Velocity: flow of data is continuous and increasing
- Veracity: data contains biases, mistakes, noise, and abnormalities
- Validity: data may not be appropriate for intended use
- Volatility: data changes over time and may become stale or invalid
Carrying out a "Big Data" project requires thoughtful planning. The project must have clearly defined objectives and "questions" that need to be answered through analysis. The project plan must also address where the data will come from, the processes for collecting, cleaning, and loading the data, and the infrastructure used to house the data. Finally, the project plan must state how the data is expected to be analyzed and how data will be kept free of identiable properties and keep personal data confidential.
Data Science is more than just analysis and data visualization. Much preparatory work goes into creating a data set from which new insights can be gained. Data Scientists are often as much Data Engineer than Data Analyst using different tools and programming languages to wrangle data and obtain new insights. Even once a data project has concluded models must be continually monitored and refined to ensure continued validity of the analysis and conclusion drawn from the analysis.
Data Science is an evolving discipline so keeping up-to-date with trends is important. The web has many useful blogsand sites for aspiring and practicing data scientists. There are also numerous websites that make useful data sets available and reviews the latest tools. Here is a short list of resources that can help you in your data science career.
- Probability Cheat Sheet: A probability cheat sheet based upon a popular course at Harvard. It can be used to study for interviews, as review for courses, or as a refresher for personal enlightenment:
- Data Science FAQ and Resource List - a gigantic list of answers to popular questions, like "What does a data scientist do?"
- Resources for Machine Learning, Statistics and Computer Science
- Kaggle - data science competitions, blogs, resources, and jobs
- Numbeo - database of user contributed data and data sets
- Data Science Weekly Newsletter - a great weekly letter for data scientists on news, tools and book suggestions
- DataTau - like Hacker News, except for data science
- Dato's (a data science startup) Blog - http://blog.dato.com/
- yhat - http://blog.yhathq.com/
- William Chen - datastories.quora.com (Statistics, Industry Data Science, Probability)
- Chris Olah - www.colah.github.io (Deep Learning, Neural Networks)
- Max Song’s Blog - https://medium.com/@pericarus/ (Perseverance and Grit in Data Science)
- Carl Shan’s Blog - www.carlshan.com (Data Science for Social Good)
- Kang, Martha. Exploring the 7 Different Types of Data Stories. June 15, 2015.
- Lorica, Ben. Why Data Preparation Frameworks Rely on Human-In-The-Loop Systems. July 2, 2015.
- Newman, Riley. How We Scaled Data Science to all Sides of AirBnB over 5 Years of Hyper Growth. June 30, 2015.