26 Data Basics

26.1 Variable Classes

I use the term variable class to refer to two broad types of variables:

	Independent	Dependent
Description	Fixed parameters studied by the experimenter	What you have measured. They change according to the state of the independent variables.
Example	Genotypes, developmental stages, cell lines, growth conditions	Expression levels, blood pressure, intensity, presence/absence of a substance
Typical axis	$x$	$y$

Not every plot will have an independent and a dependent variable. Many plots only contain one variable, or multiple dependent variables. Although the typical orientation is y as a function of x, in which case we plot the dependent variable on the y axis, the orientation may change in specific cases (see XX).

26.2 Data Classes

Broadly speaking there are two ways in which we can describe a data’s class.¹

	Continuous	Categorical
AKA	Quantitative	Qualitative, discrete, factor
Description	Can take any numeric value within a range	Distinct groups that differ in qualities
Example	Weight, continuous time, gene expression	Location, genotype, time point

A data’s class is malleable $mdash; it can change depending on how we understand it. For example we can break up a continuous variable, like p-values for many tests into those below and above a certain threshold.

Sometimes we convert type when plotting. For example, we transform a continuous variable to a discrete variable into artificial artificial categories when we apply a binning statistic for histograms.

There are three typical scales of categorical variables: binary, nominal and ordinal. Binary is the most basic type of data we can have. They are defined according to the properties in the following table:

Table : Types of Categorical Scales.

Scale	Ordered	Quantitative	Number of levels/groups	Example
Binary	-	-	$2$	Status: Present, Absent
Nominal	-	-	$>2$	Location: Berlin, Paris, London
Ordinal	Y	-	$\geq{2}$	Severity: low, medium, high
Interval	Y	Y	$\geq{2}$	Time: 0h, 24h, 48h, 96h

Traditionally interval does not refer to categorical variables. Rather it’s used to distinguish between continuous variables that do not having a natural zero which those that do, termed ratio variables. Although this is an interesting distinction for data analysis, I find it useful, in the context of data visualization, to refer to categorical variables that are quantitative, in addition to ordered. That is, an axis may be ordinal, but does the distance between the categories, or even the size of each category contain information as well? See example XX, below.

26.3 Descriptive Statistics for Continuous Variables

For continuous variables, the choice of descriptive statistics depends on the distribution of the data.²

26.3.1 Measurements for the location of a data set:

Metric	Equation	Description
Mode		The most common value of a data set
Median		The second quartile, or central value, of a data set
Mean	$\overline{x}=\frac{\sum_{i=1}^{n}x_{i}}{n}$	The arithmetic mean

26.3.2 Measurements for spread:

Metric	Equation	Description
Sample Variance ($s^{2}$)	$\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n-1}$
Sample Standard Deviation ($s$)	$\sqrt{s^{2}}$
Range	$Q_4-Q_0$	The maximum value minus the minimum value (Q4-Q0).
Inter-quartile-Range (IQR)	$Q_3-Q_1$	The middle 50% of the sample.

26.4 Inferential statistics for the spread of a data set:

Metric	Equation	Description	Example
Standard Error of the Mean (SEM)	$SEM_{\overline{x}}=\frac{s}{\sqrt{n}}$
95% Confidence Interval_ (95% CI)	$1.96(\frac{s}{\sqrt{n}})=1.96(SEM_{\overline{x}})$ or ideally, $t_{n-1}(SEM_{\overline{x}})$

Inferential statistics ($SEM$ and $95\%\ CI$) are popular for error bars, since they can be much smaller than the standard deviation. The SEM is, by definition, always smaller than the $95\%\ CI$. The $95\%\ CI$ is the range that covers the true population parameter with a $95\%$ probability. The $SEM$ is the standard deviation of the random variable $\bar{X}$, i.e. of many sample means, and is primarily used because it is so small, not because it is a useful measure of spread of a single sample.

26.4.1 Summary of Types of Plots

The table overleaf lists some of the most common types of relationships you will have to visualize, along with the most appropriate geometries. We will begin our discussion of plotting quantitative data by looking at uni-variate distributions. Comparing distributions will be the topic of interest in bi- and multivariate data visualisations, where the relationship between different variables, in addition to their individual}_ distributions, is of interest.

A summary of the different types of graphs described in this workshop:

Relationship	Points	Lines	Bars	Boxes	Area	Matrix
Distribution	Strip charts	Density plots	Histograms	Box plots	Density plots
Categorical Comparisons	Dot plots	Parallel plots	Bar charts	Box plots	Violin plots
Rank Comparison			Bar charts	Box plots
Time Series		Time line, Regression lines
Parts-of-a-whole					Pie charts, Stacked bar, Mosaic Plots
Correlation	Scatter plots, Q-Q plots	Regression lines			2D Density plots
Overlap between variables					Venn diagrams
Subsets of a variable					Euler diagrams
Multivariate summaries						Heat maps, Correlation matrix, Scatter plot matrix

26.4.2 Minimum Components of a Plot

Regardless of the plotting method used, there is a minimum amount of information your audience needs to understand and critique your research. If this information is not apparent from looking at your plot, you must provide it in the figure legend or in your oral description.

Metric	Description
Sample Size	How many observations are in each group?
Replicates	Biological replicates are most appropriate for explanatory plots. Technical replicates may be of interest during method development. Do not confuse or misrepresent the two! see XX
Error Bars	If you have error bars, specify precisely what they represent, see XX
Units and Axis labels	Make sure all axes are properly labelled, specifying the units and any transformations to the data, when necessary.
Statistics	If you include markings for statistical significance, specify what test was conducted and if you performed any multiple-testing correction.
Experimental Details	Details such as strain/line/organism should be clearly stated.

26.4.3 Regression Lines

There are two strategies for fitting a smoothed curve to a data set: parametric or non-parametric fitting. The choice of fitting algorithm not only affects how you analyse your result, but also the reader’s interpretation of them.

Parametric fitting relies on a predefined model, or equation. The fitting algorithm attempts to find the model’s best fit by adjusting the equation’s coefficients. Linear regression is a frequently used form of parametric fitting. ⁴ In the example below, showing Bradford assay calibration curves, where the OD of samples with known protein concentrations is measured, different equations change the impression of the data.

⁵

Other dichotomous naming conventions include predictor and response variables in regression. Either may be be continuous or categorical. In tidy data notation we have ID (always categorical) and measure variables (typically continuous).↩︎
For details on regression lines see the Statistical Literacy Workshop.↩︎
For normally distributed data, the mean and the standard deviation are the most appropriate measures of location and spread. Non-normally distributed data is not accurately represented by these measures; the median and IQR are more appropriate since they are robust to outliers.↩︎
Parametric fitting is popular because the model provides a equation which describes that data set. This can be used to predict values, but because the real function is normally not known, the risk of fitting a line that misrepresents the data exists.↩︎
For details on regression lines see the Statistical Literacy Workshop.↩︎