3.1 Data Visualization

Goal: Build skills and knowledge for making publication-quality data visualizations. The main things to keep in mind are “who is your audience?” and “what do you want to communicate?”.

Outline:

  • Introduce the matplotlib package

  • Make an ugly default plot

  • Go beyond default

  • Explore other types of plots

  • Look at a few ways to plot data uncertainty

Additional Assigned Reading

Ten Simple Rules for Better Figures Rougier, N. P., Droettboom, M., & Bourne, P. E. (2014), Ten Simple Rules for Better Figures. PLOS Computational Biology.

Visualization

Effective data visualization helps with data analysis and interpretation, but is also crutial for communicating your findings. matplotlib is the primary Python package for plotting. The matplotlib website has extensive documentation, tutorials, and gallery examples.

Watch this video about making plots.

Basic Default Plot

After importing the matplotlib package and loading or calculating the variables you wish to plot, you will call matplotlib functions to create your figure. fig, ax = plt.subplots() creates the figure object with a single set of axes. ax.plot(x,y) adds the data you want to plot to the figure. plt.show() will print your plot; in our Jupyter Notebook environment you don’t need this line, but you would if you were writing python code outside of a notebook so I included it here.

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting, a simple sine wave
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots()
ax.plot(t, s)
plt.show()
../_images/W3_visualize_2_0.png

This quick & simple plot may be useful for checking your data, but it’s nearly useless for communication. basic plot


There are many possible figure components you should consider when making your figure.

Matplotlib figures features

For example we should add labels to our simple figure. This can be done with ax.set() or plt.xlabel. We’ll also add a grid with ax.grid().

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting, a simple sine wave
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots()
ax.plot(t, s)

ax.set(xlabel='t', ylabel='s', title='Bare Necessities Plot')
ax.grid()

fig.savefig("test.png")
../_images/W3_visualize_5_0.png

There, we have done the bare minimium to have a readable figure. It can be saved with fig.savefig(). But we should continue and learn more tools for making good, publication quality figures.

Using Additional Plot Features

There are many strategies for making good scientific figures. The primary goal is clarity, rather than aesthetics, but bad plots are also often ugly. The first thing to consider are your axes: they should be clearly labeled with a descriptive label that includes units. Increasing the fontsize of the labels makes them easier to read also.

# We can use the same data again, we don't need to import matplotlib or declare t and s again
fig, ax = plt.subplots()
ax.plot(t, s)

ax.set_title('Simple Plot', fontsize=20)
ax.set_xlabel('Time (s)', fontsize=16)
ax.set_ylabel('Wave Height (m)', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
ax.grid()
../_images/W3_visualize_8_0.png

If you are plotting more than one dataset together on the same axes you should make it clear to the reader what the two datasets are with a legend using plt.lengend(). You can also differentiate them but plotting with different line or marker types. You may also want to adjust the limits of the x or y axis, this can be done with plt.xlim([xmin, xmax]) and plt.ylim([ymin, ymax]) or ax.set(xlim=(xmin, xmax), ylim=(ymin, ymax)).

# declare a second variable to plot
s2 = 1 + np.sin( np.pi * t)

# We can use the same data again, we don't need to import matplotlib or declare t and s again
fig, ax = plt.subplots()
ax.plot(t, s,'-',label='Day 1') #make the second line solid
ax.plot(t, s2,'--',label='Day 2') #make the second line dashed 

ax.set_title('Simple Plot', fontsize=20)
ax.set_xlabel('Time (s)', fontsize=16)
ax.set_ylabel('Wave Height (m)', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
ax.set_xlim([0.0, 2.0])
ax.grid()
ax.legend()
plt.show()
../_images/W3_visualize_10_0.png

Other Types of Plots

matplotlib is capable of making many types of plots besides just simple line-plots. Below are examples of several, but not all, of the types of plots you can make. More compicated examples with more features can be found in the matplotlib gallary.


Histogram

We use histograms throughout this course to visualize how our data is distributed. They are easy to produce with the function plt.hist() (documentation). You can set the number of bins you plot. Setting density=True will make the histogram a probability density: each bin will display the bin’s raw count divided by the total number of counts and the bin width, so that the area under the histogram integrates to 1. density=False is the default, where the total count per bin is displayed.

N_points = 1000
n_bins = 20

# Generate a normal distribution, center at x=0
x = np.random.randn(N_points)

# make a figure object with axis
fig, ax = plt.subplots()

# We can set the number of bins of the histogram with the `bins` kwarg
ax.hist(x, bins=n_bins,density=False)

ax.set_xlabel('x-axis', fontsize=15)
ax.set_ylabel('y-axis', fontsize=15)
ax.set_title('Example Histogram', fontsize=18)

plt.show()
../_images/W3_visualize_12_0.png

Bar Plot

Bar plots are similar to histograms, but the bins of counts are some category (like fruit in the example below). plt.bar() takes lists of names and count values as input.

data = {'apple': 10, 'orange': 15, 'lemon': 5, 'lime': 20}
names = list(data.keys())
values = list(data.values())

fig, ax = plt.subplots()
ax.bar(names, values)

ax.set_xlabel('Fruit', fontsize=15)
ax.set_ylabel('Count', fontsize=15)
ax.set_title('Example Bar Plot', fontsize=18)

plt.show()
../_images/W3_visualize_14_0.png

Scatter Plot

For some datasets plotting as marker points instead of a line will be prefered. The plt.scatter() function has nice features that allow you to set the size and color of the markers. Their color and size can be set for aesthetic reasons or set by other variables (besides x and y values) to communicate more information by setting the c= and s= arguments. The markers can be set as transparent with the alpha= argument.

N_points = 100

# Generate a random normal distributions, for x and y variables and also some for color and marker volume
x = np.random.randn(N_points)
y = np.random.randn(N_points)
scatter_color=np.random.randn(N_points)
scatter_size=np.random.randn(N_points)*500

fig, ax = plt.subplots()
ax.scatter(x, y, c=scatter_color, s=scatter_size, alpha=0.5)

ax.set_xlabel('x-axis', fontsize=15)
ax.set_ylabel('y-axis', fontsize=15)
ax.set_title('Example Scatter Plot', fontsize=18)

ax.grid(True)
plt.show()
/opt/anaconda3/lib/python3.7/site-packages/matplotlib/collections.py:885: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor
../_images/W3_visualize_16_1.png

Contour and Contourf

Contour plots (line or filled) are a method of plotting a 3D surface in 2D by plotting constant z values (contours) on a (x,y) grid. Lines are drawn connecting (x,y) coordinates with the same z value. We’ll use contour and contourf for mapping.

contour

plt.contour plots contour lines. It takes three variables: an x-grid and y-grid, and the z value (think of it as the height of the surface).

delta = 0.025
x = np.arange(-3.0, 3.0, delta)
y = np.arange(-2.0, 2.0, delta)
X, Y = np.meshgrid(x, y)
Z1 = np.exp(-X**2 - Y**2)
Z2 = np.exp(-(X - 1)**2 - (Y - 1)**2)
Z = (Z1 - Z2) * 2

fig, ax = plt.subplots()
CS = ax.contour(X, Y, Z)
ax.clabel(CS, inline=True, fontsize=10)
ax.set_title('Simplest default with labels')
plt.show()
../_images/W3_visualize_18_0.png

contourf

plt.contourf draws filled contours. Like plt.contour it takes three gridded inputs and x, y, and z. You can set the number of contour levels with the levels agrument e.g. levels=40.

delta = 0.025
x = np.arange(-3.0, 3.0, delta)
y = np.arange(-2.0, 2.0, delta)
X, Y = np.meshgrid(x, y)
Z1 = np.exp(-X**2 - Y**2)
Z2 = np.exp(-(X - 1)**2 - (Y - 1)**2)
Z = (Z1 - Z2) * 2

fig, ax = plt.subplots()
CS = ax.contourf(X, Y, Z, levels=12)

# Make a colorbar for the ContourSet returned by the contourf call.
cbar = fig.colorbar(CS)
cbar.ax.set_ylabel('Colorbar label')
ax.set_title('Simplest default with labels')
plt.show()
../_images/W3_visualize_20_0.png

Representing Uncertainty/Error

Observations always have an associated uncertainty. Communicating these uncertainties is an important task for scientists.

Errorbars

The errorbars respresent the scatter in the data. plt.errorbar can be used to add vertical or horizontal errorbars with the arguments yerr or xerr respectively. These arguments should be set with an array the same length as the x and y arrays you are plotting and contain the corresponding error values.

# example data
x = np.arange(0.1, 4, 0.5)
y = np.exp(-x)

# example error bar values that vary with x-position
error = 0.1 + 0.2 * x

fig, ax = plt.subplots()
ax.errorbar(x, y, yerr=error, fmt='-o')
ax.set_title('Example Line Plot with Errorbars', fontsize=18)
ax.set_xlabel('x-axis', fontsize=15)
ax.set_ylabel('y-axis', fontsize=15)
ax.grid(True)
plt.show()
../_images/W3_visualize_22_0.png

Box and whisker

Box plots also communicate the distribution of the data represented by the marker, but they give more information (median, maxiumum, minimum, quartiles). The box gives the bounds of the lower and upper quartiles (25%-75%) of the data. The whiskers extend to the minimum and maximum values, excluding outliers (plotted as points). The line through the box is the sample median.

# Random test data
all_data = [np.random.normal(0, std, size=100) for std in range(1, 4)]
labels = ['x1', 'x2', 'x3']

fig, ax = plt.subplots()

# rectangular box plot
bplot1 = ax.boxplot(all_data,
                     vert=True,  # vertical box alignment
                     patch_artist=True,  # fill with color
                     labels=labels)  # will be used to label x-ticks


ax.grid(True)
ax.set_title('Rectangular box plot', fontsize=18)
ax.set_xlabel('Three separate samples', fontsize=15)
ax.set_ylabel('Observed values', fontsize=15)

plt.show()
../_images/W3_visualize_24_0.png

Violin

Violin plots are similar to box plots, but display even more information. Rather than summary statistics (e.g. median, quartiles) they show the full distribution of data.

# generate some random test data zero mean
all_data = [np.random.normal(0, std, 100) for std in range(6, 10)]

fig, ax = plt.subplots()


# plot violin plot
ax.violinplot(all_data,
                  showmeans=False,
                  showmedians=True)
ax.set_title('Example Violin plot', fontsize=18)
ax.set_xlabel('x-axis', fontsize=15)
ax.set_ylabel('y-axis', fontsize=15)
ax.grid(True)
plt.show()
../_images/W3_visualize_26_0.png