Data Science: Methodology

These are the notes I took on the Data Science Methodology course on cognitiveclass.ai!

MODULE 1: FROM PROBLEM TO APPROACH

Business Understanding

Spend time to seek clarification to attain a business understanding
Establish a clearly defined question
Determine a goal (i.e. reduce costs – goal is to improve efficiency or profitability?)
Figure out objectives to support goal
Break down objectives
Structure how to tackle problem
Involve the key business sponsors so they can:
Set overall direction
Remain engaged and provide guidance
Ensure necessary support

Analytic Approach

Selecting the approach depends on the question being asked
Seek clarification
Select approach in the context of business requirements
Pick analytic approach based on type of question
Descriptive
Current status
Show relationships
Diagnostic (statistical analysis)
What happened?
Why is this happening?
Problems that require counts (yes/no answer → classification approach)
Predictive (forecasting)
What if these trends continue?
What will happen next?
Prescriptive
How do we solve it?
Machine learning
Learning without being explicitly programmed
Identifies relationships and trends in data that may not be accessible or identified otherwise
Uses clustering association approaches
Learning about human behaviour

Case study – decision tree classification

Lab – From Problem to Approach

Business understanding stage is important because it helps clarify the goal of the entity asking the question.
Outstanding features of data science methodology diagram above:
The flowchart is highly iterative
The flowchart never ends.
Analytic stage is important because it helps identify what type of patterns will be needed to address the question most effectively.

Decision trees

MODULE 2: FROM REQUIREMENTS TO SELECTION

Data Requirements

What is required, how to source or collect it, how to understand or work with it, how to prepare data to meet the desired outcome
Define data requirements for decision tree classification:
Identify the necessary data content, formats, and sources for initial data collection
Content, formats, representations suitable for decision tree classifier
Think ahead and consider future stages, as requirements may affect preparation

Data Collection

After initial data collection is performed, an assessment by the data scientist takes place to determine if requirements are met (have what is needed)
Data requirements are revised and decisions are made re: collect more or less data
Descriptive statistics and visualization can be applied to the data set to assess the content, quality, and initial insights about data
Gaps will be identified → plans to fill gaps or make substitutions
Data collection requires knowledge of the source or where to find needed data elements
It is acceptable to defer decisions about unavailable data and attempt to acquire it at a later stage (i.e. after getting intermediate results from predictive modeling)
DBAs and programmers work together to extract data from various sources and merge it — remove redundant data
Move on to data understanding

Case study – from requirements to collection in Python and R

Once data collection is complete, descriptive statistics and visualization techniques are used to better understand data. Data is explored to:

MODULE 3: FROM UNDERSTANDING TO PREPARATION

Data Understanding

Encompasses all activities related to constructing the data set
Answers the question: is the collected data representative of the problem to be solved?
Prepare/clean the data
Run statistics against the data columns (variables in this model)
Stats include Hearst, univariates, statistics on each variable (mean, median, min, max, stdev)
Pairwise correlations used
How closely certain variables are related
If highly correlated, then redundant
Histograms of variables examined to understand the distributions
Determine which sort of data preparation may be needed to make the variable more useful in a model (i.e. consolidation)
Data quality
Missing values
Invalid or misleading values
Use stats, univariates, histograms

Data Preparation

Unwanted elements are removed
Together with data collection and understanding, data preparation is the most time-consuming phase of a data science project (70-90% of project time)
Automation can reduce to about 50% of project time
Transforming data in the preparation phase makes the data easier to work with
Must address missing or invalid values, remove duplicates, format properly
Feature engineering – use domain knowledge of the data to create features that make the machine learning algorithms work
Critical when machine learning tools are being applied to analyze the data
Feature is a characteristic that may help when solving a problem
Text analysis steps for coding data required to manipulate data
What is needed within dataset to address the question
Ensure proper groupings
Ensure programming is not overlooking what is hidden within the data

MODULE 4: FROM MODELLING TO EVALUATION

Modelling

Training set is used for predictive modelling
Acts as a gauge to determine if the model needs to be calibrated
Try out different algorithms to ensure that the variables in play are actually required
Understanding the question
1. Understand the data
2. Appropriate analytical approach
3. Obtain, understand, prepare data
Data supports answering the question
Constant refinement necessary within each step

Evaluation

1. Diagnostic measures
ensure the model is working as intended
Adjustments required?
Predictive – use decision tree to evaluate if the answer is aligned with the initial design
Descriptive – testing set with known outcomes can be applied and model is refined as needed
2. Statistical significance testing
Ensure data is properly handled and interpreted within the model
Avoid unnecessary second guessing when the answer is revealed
ROC curve – receiver operating characteristic curve

Diagnostic tool
Determine the optimal classification model
Quantifies how well a binary classification model performs
Declassifies the yes and no outcomes when some discrimination criterion is varied
Optimal model at maximum separation
Confusion matrix
Summary of how well the categories are classified
Sheds light on what may be confused with a different category
Stats —
Type I error: false-positive
Type II error: false-negative

MODULE 5: FROM DEPLOYMENT TO FEEDBACK

Deployment

Make answer relevant by getting the stakeholders familiar with the tool produced
Solution owner, marketing, app developers, IT admin
Once the model is evaluated, it is deployed
Option: limited test group in a test environment
Feedback and refinement over time

Feedback

Feedback helps refine the model and asses it for performance and impact
Value of the model is dependent on incorporating feedback and adjusting accordingly
Ultimate test: actual real-time use in the field