26 Data Basics
26.1 Variable Classes
I use the term variable class to refer to two broad types of variables:
Independent | Dependent | |
---|---|---|
Description | Fixed parameters studied by the experimenter | What you have measured. They change according to the state of the independent variables. |
Example | Genotypes, developmental stages, cell lines, growth conditions | Expression levels, blood pressure, intensity, presence/absence of a substance |
Typical axis | \(x\) | \(y\) |
Not every plot will have an independent and a dependent variable. Many plots only contain one variable, or multiple dependent variables. Although the typical orientation is y as a function of x, in which case we plot the dependent variable on the y axis, the orientation may change in specific cases (see XX).
26.2 Data Classes
Broadly speaking there are two ways in which we can describe a data’s class.1
Continuous | Categorical | |
---|---|---|
AKA | Quantitative | Qualitative, discrete, factor |
Description | Can take any numeric value within a range | Distinct groups that differ in qualities |
Example | Weight, continuous time, gene expression | Location, genotype, time point |
A data’s class is malleable $mdash; it can change depending on how we understand it. For example we can break up a continuous variable, like p-values for many tests into those below and above a certain threshold.
Sometimes we convert type when plotting. For example, we transform a continuous variable to a discrete variable into artificial artificial categories when we apply a binning statistic for histograms.
There are three typical scales of categorical variables: binary, nominal and ordinal. Binary is the most basic type of data we can have. They are defined according to the properties in the following table:
Table : Types of Categorical Scales.
Scale | Ordered | Quantitative | Number of levels/groups | Example |
---|---|---|---|---|
Binary | - | - | \(2\) | Status: Present, Absent |
Nominal | - | - | \(>2\) | Location: Berlin, Paris, London |
Ordinal | Y | - | \(\geq{2}\) | Severity: low, medium, high |
Interval | Y | Y | \(\geq{2}\) | Time: 0h, 24h, 48h, 96h |
Traditionally interval does not refer to categorical variables. Rather it’s used to distinguish between continuous variables that do not having a natural zero which those that do, termed ratio variables. Although this is an interesting distinction for data analysis, I find it useful, in the context of data visualization, to refer to categorical variables that are quantitative, in addition to ordered. That is, an axis may be ordinal, but does the distance between the categories, or even the size of each category contain information as well? See example XX, below.
26.3 Descriptive Statistics for Continuous Variables
For continuous variables, the choice of descriptive statistics depends on the distribution of the data.2
26.3.1 Measurements for the location of a data set:
Metric | Equation | Description | Example |
---|---|---|---|
Mode | The most common value of a data set | ||
Median | The second quartile, or central value, of a data set | ||
Mean | \(\overline{x}=\frac{\sum_{i=1}^{n}x_{i}}{n}\) | The arithmetic mean |
26.3.2 Measurements for spread:
Metric | Equation | Description | Example |
---|---|---|---|
Sample Variance (\(s^{2}\)) | \(\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n-1}\) | ||
Sample Standard Deviation (\(s\)) | \(\sqrt{s^{2}}\) | ||
Range | \(Q_4-Q_0\) | The maximum value minus the minimum value (Q4-Q0). | |
Inter-quartile-Range (IQR) | \(Q_3-Q_1\) | The middle 50% of the sample. |
26.4 Inferential statistics for the spread of a data set:
Metric | Equation | Description | Example |
---|---|---|---|
Standard Error of the Mean (SEM) | \(SEM_{\overline{x}}=\frac{s}{\sqrt{n}}\) | ||
95% Confidence Interval_ (95% CI) | \(1.96(\frac{s}{\sqrt{n}})=1.96(SEM_{\overline{x}})\) or ideally, \(t_{n-1}(SEM_{\overline{x}})\) |
Inferential statistics (\(SEM\) and \(95\%\ CI\)) are popular for error bars, since they can be much smaller than the standard deviation. The SEM is, by definition, always smaller than the \(95\%\ CI\). The \(95\%\ CI\) is the range that covers the true population parameter with a \(95\%\) probability. The \(SEM\) is the standard deviation of the random variable \(\bar{X}\), i.e. of many sample means, and is primarily used because it is so small, not because it is a useful measure of spread of a single sample.
26.4.1 Summary of Types of Plots
The table overleaf lists some of the most common types of relationships you will have to visualize, along with the most appropriate geometries. We will begin our discussion of plotting quantitative data by looking at uni-variate distributions. Comparing distributions will be the topic of interest in bi- and multivariate data visualisations, where the relationship between different variables, in addition to their individual}_ distributions, is of interest.
A summary of the different types of graphs described in this workshop:
Relationship | Points | Lines | Bars | Boxes | Area | Matrix |
---|---|---|---|---|---|---|
Distribution | Strip charts | Density plots | Histograms | Box plots | Density plots | |
Categorical Comparisons | Dot plots | Parallel plots | Bar charts | Box plots | Violin plots | |
Rank Comparison | Bar charts | Box plots | ||||
Time Series | Time line, Regression lines | |||||
Parts-of-a-whole | Pie charts, Stacked bar, Mosaic Plots | |||||
Correlation | Scatter plots, Q-Q plots | Regression lines | 2D Density plots | |||
Overlap between variables | Venn diagrams | |||||
Subsets of a variable | Euler diagrams | |||||
Multivariate summaries | Heat maps, Correlation matrix, Scatter plot matrix |
26.4.2 Minimum Components of a Plot
Regardless of the plotting method used, there is a minimum amount of information your audience needs to understand and critique your research. If this information is not apparent from looking at your plot, you must provide it in the figure legend or in your oral description.
Metric | Description |
---|---|
Sample Size | How many observations are in each group? |
Replicates | Biological replicates are most appropriate for explanatory plots. Technical replicates may be of interest during method development. Do not confuse or misrepresent the two! see XX |
Error Bars | If you have error bars, specify precisely what they represent, see XX |
Units and Axis labels | Make sure all axes are properly labelled, specifying the units and any transformations to the data, when necessary. |
Statistics | If you include markings for statistical significance, specify what test was conducted and if you performed any multiple-testing correction. |
Experimental Details | Details such as strain/line/organism should be clearly stated. |
26.4.3 Regression Lines
There are two strategies for fitting a smoothed curve to a data set: parametric or non-parametric fitting. The choice of fitting algorithm not only affects how you analyse your result, but also the reader’s interpretation of them.
Parametric fitting relies on a predefined model, or equation. The fitting algorithm attempts to find the model’s best fit by adjusting the equation’s coefficients. Linear regression is a frequently used form of parametric fitting. 4 In the example below, showing Bradford assay calibration curves, where the OD of samples with known protein concentrations is measured, different equations change the impression of the data.
Other dichotomous naming conventions include predictor and response variables in regression. Either may be be continuous or categorical. In tidy data notation we have ID (always categorical) and measure variables (typically continuous).↩︎
For details on regression lines see the Statistical Literacy Workshop.↩︎
For normally distributed data, the mean and the standard deviation are the most appropriate measures of location and spread. Non-normally distributed data is not accurately represented by these measures; the median and IQR are more appropriate since they are robust to outliers.↩︎
Parametric fitting is popular because the model provides a equation which describes that data set. This can be used to predict values, but because the real function is normally not known, the risk of fitting a line that misrepresents the data exists.↩︎
For details on regression lines see the Statistical Literacy Workshop.↩︎