27 Using Design to Communicate

William S. Cleveland describes constructing graphs as an exercise in visual encoding. The reader must use visual decoding to interpret the graph, which he refers to as the process of graphical perception. In this section I’ll introduce you to some design aspects of graphical perception that aid in producing statistical plots.

27.1 Small Multiples

Edward Tufte introduced the idea of small multiples as a way of presenting sub-sets of a data set in comparable formats. The concept relies on keeping the design of related graphs identical so that the user can focus on changes in the data.

Figure 27.1: Six overlapping trend lines on a single plot are overwhelming for the viewer.

Using small multiples (fig. @ref(fig:datingFacets)), each sub-set is plotted in its own identical plotting space. We refer to each sub-plot as a facet.¹ We can more easily distinguish the six trend lines, but making direct comparisons is a bit more difficult. In this case color is redundant, so all points have the same color.

Small multiples separate each sub-set into its own plot, making it easier to see individual trends.

27.2 Data-Ink Ratio

Edward Tufte also strongly advocates maximizing the data-ink ratio. The data:ink ratio is the ratio of data ink to total ink and considers how much ink is actually necessary, as opposed to being decorative. In short, the amount of non-data ink should be minimized. This is in line with classic design principles of less is more and simplicity, however there doesn’t appear to be any emperical support that removing non-data ink aids in visual communication. In my experience, plots with a high data:ink ratio do tend to look more professional and are more apealing. My approach is not as extreme as Tufte, who is quite ruthless, removing even axis lines. I advocate uncluttered and simple charts as opposed to minimal charts.

Every element on your plot (data and non-data) has to justify its presence

The advantage of reducing the information content on each facet is that we now have room to add more interesting variables. In this case, we can add information on the regression line and correlation coefficient.

27.3 Aspect Ratio

A plot’s proportions can have a large effect on how data is perceived. Therefore, finding the optimal aspect ratio (height divided by weight) is of particular interest. Fig. @ref(fig:aspectRatio) is a classic example from William S. Cleveland. Here we have emperical data going back to 1750 $mdash; the number of sun spots which appear on the surface of the sun. There are three clear trends in this data. In the upper plot, we can see two trends.

First, sun spots appear and disappear in a period of approximately 11-years. The beginning and end of these periods has been recorded and we’ll get to the individual trends shortly. Second, there are global rises and dips in the peaks and throughs.

The upper plot has an aspect ratio of 1:1, meaning that one unit on the y axis takes up the same physical space as one unit on the x axis. this is highlighted by the red, perfectly square, box. In this case, there’s no reason for a 1:1 aspect ratio. The axes are on completely different scales $mdash; the number of sun spots and the date.

Reducing the aspect ratio to 1:2 (0.5, fig. @ref(fig:aspectRatio) middle plot) or even further, to a much smaller aspect ratio of 1:20 (0.05, fig. @ref(fig:aspectRatio) lower plot), allows us to see the relationshop between sun spot peak and intensity. The higher the number of sun spots in a cycle are, the my asymmetrical it is. That is, sun spots take a longer time to disappear than they do to appear in high-peack cycles.

265 years of sun spot counts. The very low aspect ratio of the lower plot helps show the recurring trend that sun spots take longer to disappear than they do to appear.

Although the relationship between intensity and symmetry in sun spot oscillations is a well-observed phenomena, it is admittedly still somewhat difficult to see in low aspect ratio plot of fig. @ref(fig:aspectRatio). Now that we have a clear message we want to communicate, let’s see if we can make a plot that emphasizes that point.

We can reduce the complexity even further by plotting only wlhat we’re interested in. Here, for each cycle, we can plot the maximum number of sun spots and distance from the middle of the cycle. As we expected, the higher the maximum number is, the earlier the peak occurs.

Althouch Cleveland tried to implement an analytical solution to choosing the optimal aspect ratio, it was not applicable in all cases and was never widely adopted. This means you’re at your own discretion to choosing an aspect ratio.

Aspect ratios must not distort the data to the point of deception (e.g. emphasizing or eliminating trends).

Except in some rare circumstances, use an aspect ratio of 1 when the axes have equivalent units (i.e. equivalent physical distances reflect equivalent scales).

27.4 Elements for Encoding Continuous Variables

The optimal encoding of continuous variables uses a common scale, such as multiple dots in a scatter plot on common X and Y axes. Less useful encoding elements are area, angle and length without a common scale². The least useful encoding element is the colour spectrum, however there are some exceptions (see discussion on page pageref sub:Smoothed-scatter).

27.5 Elements for Encoding Categorical Variables

The visual encoding of categorical variables is very flexible since these variables are likely to be qualitative (representing a group) rather than quantitative (representing an actual value on a scale). In general the choice will depend on what is being encoded.

Sequential colours are preferred for ordered variables, since there is an inherent ordering to the categories.

Shape outlines are ideal for dense data sets (preferentially open circles

Filled shapes are suitable if there is no over-plotting or if alpha-blending is used (see page Alpha-blending).

Direct labels can be combined with the other elements to highlight specific data points (see direct line labeling in figure irrigation-good.

Hatching and colours can be used to distinguish groups in bar and box plots.

Elements for encoding categorical variables.

27.6 The Case Against Pie Charts and Heat maps

Fig @red{fig:Cont-Encoding} tells us why pie charts and heat maps, are not optimal for visualization of quantitative information. Both rely on encoding elements that are poorly decoded by the viewer. Here I’ll consider the two types of pie charts, when they are used and some great alternatives.

27.6.1 Type 1: Different observations, same variables

A common type of pie chart is what I call the different observations, same variables variant. An example is given in fig. @ref(fig:Elections2017MakebarChartSplit).

In the 2017 federal elections in Germany, 16 bundesländer (states) took part in voting for 5 well-established parties (the incumband CDU and the Bavarian sister party CSU, SPD, Linke, Grüne and FDP), one young party (AfD) and a smattering of other regional parties. Here different observations (i.e. voters) of the same variables (i.e. political parties) are plotted. The pie slices are colored according to the party colors, making interpretation at least somewhat intuitive for the informed viewer. The pie charts are arranged according to CDU/CSU results, instead of alphabetically, allowing us to include another layer of information.

16 pie charts, each consisting of 6 differnt colors is overwhelming for the reader. It’s also difficult to make comparisons between distant sub-plots and with slices that are in different orientations.

The classic solution to a collection of different observations, same variables pie charts is a stack of horizontal, proportional bar charts, fig. @ref(fig:Elections2017barChart). Here, many of the inefficiencies of decoding a pie chart are alleviated. Each chart begins and ends at the same point and the only difficulty is comparing segments in the middle, which begin and end at differnt points on the scale (i.e. Common, but unaligned scale). It’s a deficiency we can live with since it does improve readability considerably.

Instead of ordering the bars according to CDU/CSU results, we’ll use AfD results. The astonishing results of the young, right-wing, AfD party was mostly unexpected and fig. @ref(fig:Elections2017barChart) allows us to see that, for example, they achieved approximately 25% of the vote in Sachsen.

16 stacked horizontal proportional bar charts contains a lot of information in an easy to read format.

Anoter level of information that is easy to convey here is the distinction between the alte, formerly West German, and the neue, formerly East German, Bundesländer, depicted in @ref(fig:electionBarSplit). This allows us to see a clear distinction between the “old” and “new” states. Reunification didn’t occur that long ago and it appears that distinctions between the former German states still persist.

It’s easier to introduce another variable via small multiples when our plots are already well organized.

We’ve already move quit a bit away from pie charts, which means we can start thinking about introducing other interesting variables. For example, pie charts are really terrible for looking at a change over time. Allowing for a few exceptions, time is typically mapped onto the x axis and drawn with a line. Fig. @ref(fig:Elections2017timeSeries) allows us to compare the 2017 election results with those from 2013. Here, I’ve oped to arrange the individual sub-plots in alphabetical order.

27.6.2 Type 2: Same observations, Different variables

The other type of pie chart is the same observations, different variables variant. This one is a bit more tricky to work with, but there is an elegant solution. In @ref(fig:Bayerpiecharts) we have three pie charts which I’ve adapted from a review article. This review article reported efforts from bayer Pharmaceuticals headquarters in Berlin to reproduce a number of studies in various biological research fields. In the original review, Bayer reported that only about a quarter of the 67 studies surveyed could be reproduced in their hands. They rightly highlight reproducibility as an important topic, so it’s a bit dismaying that they didn’t provide the actual data for their figure. Here, I’ve simulated some values that fit the struture. Each pie chart depicts the same observations but different variables.

Same observation, different variables pie charts.

Aside from the aforementioned difficulties in reading pie charts, this collection is disappointing because a very important and interesting question of the study is obscured. We want to know the story of each study assessed here. That is what value does each observation take in each of the three pie charts. It’s just not possible to convey this in three discrete pie charts. The solution would be a Sankey diagram, but that is outside the scope of this course.

Note, this is very distinct from plotting six separate plots!↩︎
Area, angle and length are used with pie charts, which is primarily why they are not recommended.↩︎