26  Data Basics

26.1 Variable Classes

I use the term variable class to refer to two broad types of variables:

Independent Dependent
Description Fixed parameters studied by the experimenter What you have measured. They change according to the state of the independent variables.
Example Genotypes, developmental stages, cell lines, growth conditions Expression levels, blood pressure, intensity, presence/absence of a substance
Typical axis \(x\) \(y\)

Not every plot will have an independent and a dependent variable. Many plots only contain one variable, or multiple dependent variables. Although the typical orientation is y as a function of x, in which case we plot the dependent variable on the y axis, the orientation may change in specific cases (see XX).

26.2 Data Classes

Broadly speaking there are two ways in which we can describe a data’s class.1

Continuous Categorical
AKA Quantitative Qualitative, discrete, factor
Description Can take any numeric value within a range Distinct groups that differ in qualities
Example Weight, continuous time, gene expression Location, genotype, time point

A data’s class is malleable $mdash; it can change depending on how we understand it. For example we can break up a continuous variable, like p-values for many tests into those below and above a certain threshold.

Sometimes we convert type when plotting. For example, we transform a continuous variable to a discrete variable into artificial artificial categories when we apply a binning statistic for histograms.

There are three typical scales of categorical variables: binary, nominal and ordinal. Binary is the most basic type of data we can have. They are defined according to the properties in the following table:

Table : Types of Categorical Scales.

Scale Ordered Quantitative Number of levels/groups Example
Binary - - \(2\) Status: Present, Absent
Nominal - - \(>2\) Location: Berlin, Paris, London
Ordinal Y - \(\geq{2}\) Severity: low, medium, high
Interval Y Y \(\geq{2}\) Time: 0h, 24h, 48h, 96h

Traditionally interval does not refer to categorical variables. Rather it’s used to distinguish between continuous variables that do not having a natural zero which those that do, termed ratio variables. Although this is an interesting distinction for data analysis, I find it useful, in the context of data visualization, to refer to categorical variables that are quantitative, in addition to ordered. That is, an axis may be ordinal, but does the distance between the categories, or even the size of each category contain information as well? See example XX, below.

26.3 Descriptive Statistics for Continuous Variables

For continuous variables, the choice of descriptive statistics depends on the distribution of the data.2

26.3.1 Measurements for the location of a data set:

Metric Equation Description Example
Mode The most common value of a data set
Median The second quartile, or central value, of a data set
Mean \(\overline{x}=\frac{\sum_{i=1}^{n}x_{i}}{n}\) The arithmetic mean

3

26.3.2 Measurements for spread:

Metric Equation Description Example
Sample Variance (\(s^{2}\)) \(\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n-1}\)
Sample Standard Deviation (\(s\)) \(\sqrt{s^{2}}\)
Range \(Q_4-Q_0\) The maximum value minus the minimum value (Q4-Q0).
Inter-quartile-Range (IQR) \(Q_3-Q_1\) The middle 50% of the sample.

26.4 Inferential statistics for the spread of a data set:

Metric Equation Description Example
Standard Error of the Mean (SEM) \(SEM_{\overline{x}}=\frac{s}{\sqrt{n}}\)
95% Confidence Interval_ (95% CI) \(1.96(\frac{s}{\sqrt{n}})=1.96(SEM_{\overline{x}})\)
or ideally, \(t_{n-1}(SEM_{\overline{x}})\)

Inferential statistics (\(SEM\) and \(95\%\ CI\)) are popular for error bars, since they can be much smaller than the standard deviation. The SEM is, by definition, always smaller than the \(95\%\ CI\). The \(95\%\ CI\) is the range that covers the true population parameter with a \(95\%\) probability. The \(SEM\) is the standard deviation of the random variable \(\bar{X}\), i.e. of many sample means, and is primarily used because it is so small, not because it is a useful measure of spread of a single sample.

26.4.1 Summary of Types of Plots

The table overleaf lists some of the most common types of relationships you will have to visualize, along with the most appropriate geometries. We will begin our discussion of plotting quantitative data by looking at uni-variate distributions. Comparing distributions will be the topic of interest in bi- and multivariate data visualisations, where the relationship between different variables, in addition to their individual}_ distributions, is of interest.

A summary of the different types of graphs described in this workshop:

Relationship Points Lines Bars Boxes Area Matrix
Distribution Strip charts Density plots Histograms Box plots Density plots
Categorical Comparisons Dot plots Parallel plots Bar charts Box plots Violin plots
Rank Comparison Bar charts Box plots
Time Series Time line, Regression lines
Parts-of-a-whole Pie charts, Stacked bar, Mosaic Plots
Correlation Scatter plots, Q-Q plots Regression lines 2D Density plots
Overlap between variables Venn diagrams
Subsets of a variable Euler diagrams
Multivariate summaries Heat maps, Correlation matrix, Scatter plot matrix

26.4.2 Minimum Components of a Plot

Regardless of the plotting method used, there is a minimum amount of information your audience needs to understand and critique your research. If this information is not apparent from looking at your plot, you must provide it in the figure legend or in your oral description.

Metric Description
Sample Size How many observations are in each group?
Replicates Biological replicates are most appropriate for explanatory plots.
Technical replicates may be of interest during method development.
Do not confuse or misrepresent the two! see XX
Error Bars If you have error bars, specify precisely what they represent, see XX
Units and Axis labels Make sure all axes are properly labelled, specifying the units and any transformations to the data, when necessary.
Statistics If you include markings for statistical significance, specify what test was conducted and if you performed any multiple-testing correction.
Experimental Details Details such as strain/line/organism should be clearly stated.

26.4.3 Regression Lines

There are two strategies for fitting a smoothed curve to a data set: parametric or non-parametric fitting. The choice of fitting algorithm not only affects how you analyse your result, but also the reader’s interpretation of them.

Parametric fitting relies on a predefined model, or equation. The fitting algorithm attempts to find the model’s best fit by adjusting the equation’s coefficients. Linear regression is a frequently used form of parametric fitting. 4 In the example below, showing Bradford assay calibration curves, where the OD of samples with known protein concentrations is measured, different equations change the impression of the data.

5


  1. Other dichotomous naming conventions include predictor and response variables in regression. Either may be be continuous or categorical. In tidy data notation we have ID (always categorical) and measure variables (typically continuous).↩︎

  2. For details on regression lines see the Statistical Literacy Workshop.↩︎

  3. For normally distributed data, the mean and the standard deviation are the most appropriate measures of location and spread. Non-normally distributed data is not accurately represented by these measures; the median and IQR are more appropriate since they are robust to outliers.↩︎

  4. Parametric fitting is popular because the model provides a equation which describes that data set. This can be used to predict values, but because the real function is normally not known, the risk of fitting a line that misrepresents the data exists.↩︎

  5. For details on regression lines see the Statistical Literacy Workshop.↩︎