What is Data Science?
1.11 Questions in data science
Questions regarding “the data?”
Data Provenence
What, literally, is “the data?”
-
What is its provenance?
- Who measured it?
- How did they measure it?
- Which measurement instrument?
- Which model?
- Which software and which version?
- Where did it come from?
- When was it measured?
- How precise and reproducible were the measurement devices?
- Why was it originally measured?
- Who paid for it to be collected and/or analyzed?
- Did they make any implicit or explicit assumptions?
- Are there any potentially confounding variables that merit attention?
- What have they, or others, done with the data?
- Do we agree with these results?
- Are they relevant to our goals?
- What other questions can this data reasonable answer?
- Are there suspicious and or uninformative observations and or variables?
- How would I know?
- Given my previous knowledge about the topic, what would I expect to see?
- Can a mathematical relationship between variables be defined?
- Does it have any real-world value? Should I care?
- Do one or more variables predict another?
- Are they actually relevant, or is it just due to chance? Do I care?
- Why would I want to do that? Does it make real-world sense?
Data Validation
- Is this raw data? How could I tell if it isn’t?
- Are there any signs of data munging (processing)?
- Are there any signs of suspicious data manipulation?
- Does it matter?
- What is the best way to describe each variable?
- Are there interesting clusters in the observations?
- Is that relevant question?
Analytical Questions