Data Science: Methodology

These are the notes I took on the Data Science Methodology course on cognitiveclass.ai!

MODULE 1: FROM PROBLEM TO APPROACH

Business Understanding

  1. Spend time to seek clarification to attain a business understanding

  2. Establish a clearly defined question

  3. Determine a goal (i.e. reduce costs – goal is to improve efficiency or profitability?)

  4. Figure out objectives to support goal

  5. Break down objectives

  6. Structure how to tackle problem

  7. Involve the key business sponsors so they can:

  8. Set overall direction

  9. Remain engaged and provide guidance

  10. Ensure necessary support

Analytic Approach

  1. Selecting the approach depends on the question being asked

  2. Seek clarification

  3. Select approach in the context of business requirements

  4. Pick analytic approach based on type of question

  5. Descriptive

  6. Current status

  7. Show relationships

  8. Diagnostic (statistical analysis)

  9. What happened?

  10. Why is this happening?

  11. Problems that require counts (yes/no answer → classification approach)

  12. Predictive (forecasting)

  13. What if these trends continue?

  14. What will happen next?

  15. Prescriptive

  16. How do we solve it?

  17. Machine learning

  18. Learning without being explicitly programmed

  19. Identifies relationships and trends in data that may not be accessible or identified otherwise

  20. Uses clustering association approaches

  21. Learning about human behaviour

Case study – decision tree classification

  1. Predictive model

  2. Decision tree classification

  3. Categorical outcome

  4. Explicit decision path showing conditions leading to high risk

  5. Likelihood of classified outcome along with the predicted outcome

  6. Easy to understand and apply

Lab – From Problem to Approach

  1. Business understanding stage is important because it helps clarify the goal of the entity asking the question.

  2. Outstanding features of data science methodology diagram above:

  3. The flowchart is highly iterative

  4. The flowchart never ends.

  5. Analytic stage is important because it helps identify what type of patterns will be needed to address the question most effectively.

Decision trees

  1. Built using recursive partitioning to classify data

  2. Use the most predictive feature to partition data

  3. Predictiveness is based on decrease in entropy, gain in information or impurity

  4. Tree stops growing at a node when:

  5. Pure or nearly pure

  6. No remaining variables

  7. Tree has reached (preselected) size limit

MODULE 2: FROM REQUIREMENTS TO SELECTION

Data Requirements

  1. What is required, how to source or collect it, how to understand or work with it, how to prepare data to meet the desired outcome

  2. Define data requirements for decision tree classification:

  3. Identify the necessary data content, formats, and sources for initial data collection

  4. Content, formats, representations suitable for decision tree classifier

  5. Think ahead and consider future stages, as requirements may affect preparation

Data Collection

  1. After initial data collection is performed, an assessment by the data scientist takes place to determine if requirements are met (have what is needed)

  2. Data requirements are revised and decisions are made re: collect more or less data

  3. Descriptive statistics and visualization can be applied to the data set to assess the content, quality, and initial insights about data

  4. Gaps will be identified → plans to fill gaps or make substitutions

  5. Data collection requires knowledge of the source or where to find needed data elements

  6. It is acceptable to defer decisions about unavailable data and attempt to acquire it at a later stage (i.e. after getting intermediate results from predictive modeling)

  7. DBAs and programmers work together to extract data from various sources and merge it — remove redundant data

  8. Move on to data understanding

Case study – from requirements to collection in Python and R

Web scraping of online food recipes: http://yongyeol.com/papers/ahn-flavornet-2011.pdf

Once data collection is complete, descriptive statistics and visualization techniques are used to better understand data. Data is explored to:

  1. Understand its content

  2. Assess its quality

  3. Discover any interesting preliminary insights

  4. Determine the need for additional data

MODULE 3: FROM UNDERSTANDING TO PREPARATION

Data Understanding

  1. Encompasses all activities related to constructing the data set

  2. Answers the question: is the collected data representative of the problem to be solved?

  3. Prepare/clean the data

  4. Run statistics against the data columns (variables in this model)

  5. Stats include Hearst, univariates, statistics on each variable (mean, median, min, max, stdev)

  6. Pairwise correlations used

  7. How closely certain variables are related

  8. If highly correlated, then redundant

  9. Histograms of variables examined to understand the distributions

  10. Determine which sort of data preparation may be needed to make the variable more useful in a model (i.e. consolidation)

  11. Data quality

  12. Missing values

  13. Invalid or misleading values

  14. Use stats, univariates, histograms

Data Preparation

  1. Unwanted elements are removed

  2. Together with data collection and understanding, data preparation is the most time-consuming phase of a data science project (70-90% of project time)

  3. Automation can reduce to about 50% of project time

  4. Transforming data in the preparation phase makes the data easier to work with

  5. Must address missing or invalid values, remove duplicates, format properly

  6. Feature engineering – use domain knowledge of the data to create features that make the machine learning algorithms work

  7. Critical when machine learning tools are being applied to analyze the data

  8. Feature is a characteristic that may help when solving a problem

  9. Text analysis steps for coding data required to manipulate data

  10. What is needed within dataset to address the question

  11. Ensure proper groupings

  12. Ensure programming is not overlooking what is hidden within the data

MODULE 4: FROM MODELLING TO EVALUATION

Modelling

  1. Descriptive or predictive models

  1. Descriptive

  2. If a person did x, they are likely to prefer y

  3. Predictive

  4. Yields more yes/no or stop/go outcomes

  5. Models are based on the analytic approach (statistically driven or ML driven)

  1. Training set is used for predictive modelling

  2. Acts as a gauge to determine if the model needs to be calibrated

  3. Try out different algorithms to ensure that the variables in play are actually required

  4. Understanding the question

  5. 1. Understand the data

  6. 2. Appropriate analytical approach

  7. 3. Obtain, understand, prepare data

  8. Data supports answering the question

  9. Constant refinement necessary within each step

Evaluation

  1. Evaluation is done during model development and before it is deployed

  2. Assess quality

  3. Ensure it meets the initial request → if not, then adjust

  1. 1. Diagnostic measures

  2. ensure the model is working as intended

  3. Adjustments required?

  4. Predictive – use decision tree to evaluate if the answer is aligned with the initial design

  5. Descriptive – testing set with known outcomes can be applied and model is refined as needed

  6. 2. Statistical significance testing

  7. Ensure data is properly handled and interpreted within the model

  8. Avoid unnecessary second guessing when the answer is revealed

  9. ROC curve – receiver operating characteristic curve

  1. Diagnostic tool

  2. Determine the optimal classification model

  3. Quantifies how well a binary classification model performs

  4. Declassifies the yes and no outcomes when some discrimination criterion is varied

  5. Optimal model at maximum separation

  6. Confusion matrix

  7. Summary of how well the categories are classified

  8. Sheds light on what may be confused with a different category

  9. Stats —

  10. Type I error: false-positive

  11. Type II error: false-negative

MODULE 5: FROM DEPLOYMENT TO FEEDBACK

Deployment

  1. Make answer relevant by getting the stakeholders familiar with the tool produced

  2. Solution owner, marketing, app developers, IT admin

  3. Once the model is evaluated, it is deployed

  4. Option: limited test group in a test environment

  5. Feedback and refinement over time

Feedback

  1. Feedback helps refine the model and asses it for performance and impact

  2. Value of the model is dependent on incorporating feedback and adjusting accordingly

  3. Ultimate test: actual real-time use in the field

#careerchange #datascience #datasciencejourney #selftaught

©2019 by busybree. Proudly created with Wix.com