Data Visualization with ggplot2

2021-01-28 · 6 min read

In this post I will share some frequently used ggplot2 commands when
making data visualization.

To make it easy to replicate, I will use the built-in iris and titanic dataset, which consists of numeric variable and categorical variable, for illustration.

Load the data and package

library(ggplot2)
library(tidyverse)
data(iris)
data("Titanic")
df = iris
df_titanic = as_data_frame(Titanic)

# take a look at iris
head(df)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

head(df_titanic)

## # A tibble: 6 x 5
##   Class Sex    Age   Survived     n
##   <chr> <chr>  <chr> <chr>    <dbl>
## 1 1st   Male   Child No           0
## 2 2nd   Male   Child No           0
## 3 3rd   Male   Child No          35
## 4 Crew  Male   Child No           0
## 5 1st   Female Child No           0
## 6 2nd   Female Child No           0

Scatter plots

We can use ggplot to create the coordinate system and use
geom_point() to add a layer of points to the graph.

ggplot(data = df) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width))

We can use size to change the dots size. Use color and shape
to make the scatter plots more informative.

# change the size
ggplot(data = df) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width), size = 1)

# add color and shape
ggplot(data = df) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, shape = Species, color = Species))

Sometimes we may only interested in visualizing a data in certain scale.
It can be easily done with the help of filter function in tidyverse
package.

# visualiza data in certain scale
ggplot(data = filter(df, Sepal.Length > 5)) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width), size = 1)

If we want to group by categories, we simply add facet_wrap()
function.

# group by category
ggplot(data = df) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width,color = Species), size = 1) +
  facet_wrap(~Species)

Bar charts

Here we will start to use titanic dataset. A basic bar chart can be made
like this:

ggplot(df_titanic) + geom_bar(aes(x = Age))

By applying fill a tacked bar can be created:

ggplot(df_titanic) + geom_bar(aes(x = Age, fill = Survived) )

If we do not want the bar to be stacked:

ggplot(df_titanic) + geom_bar(aes(x = Age, fill = Survived), position = "dodge" )

Customization: Titles and labels

A formal plot needs to have proper titles, axis label and legend titles,
etc. These can be set by using labs:

bar_titanic = ggplot(df_titanic) +
                geom_bar(aes(x = Age, fill = Survived) )

bar_titanic +
  labs(title = "Survival Age",
       subtitle = "Same survival number of adult and child",
       caption = "Source: R built-in dataset",
       x = "Passenger age",
       y = "Number",
       fill = "Survived or not")

Customization: Scales

Scales can map values in the data space to the “aesthetic space”. It
allows us to adjust the plot aesthetically.

bar_titanic + scale_fill_discrete(labels = c("Did not survived", "Survived"))

Note that in the example, fill = Survived which is a discrete variable
that is why we use scale_fill_discrete. Otherwise we may use
scale_fill_continuous() if it is a continuous data or
scale_fill_date() if it is date data./

The colors can be changed manually:

bar_titanic + scale_fill_manual(labels = c("Did not survived", "Survived"),
                                values = c("grey", "blue"))

Please be cautious that many times we need to use color-blind friendly
color
palette.
So instead of using value = “color name”, we should use the pre-set
palette:

# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

bar_titanic + scale_fill_manual(labels = c("Did not survived", "Survived"),
                                values = c(cbPalette[1], cbPalette[2]))

ggplot(data = df) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  scale_y_continuous(breaks = c(2,3,4), labels = c ("2","3", "4"))+
  labs(title = "Set y-axis to be looser")

ggplot(data = df) +
  geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
  scale_y_continuous(breaks = seq(0,5,0.1))+
  labs(title = "Set y-axis to be more compact")

Other resources for reference

ggplot 2 is a powerful package and it is constantly evolving, there are
some useful resouces online:

http://www.cookbook-r.com/Graphs/
shows lots of graphing basics.

https://exts.ggplot2.tidyverse.org/gallery/
gives some fancy extensions.

In the end I would like to say that in the real word, it is more time
consuming to clean the data than to visualize them therefore it is
important to learn how to impute the dataset as well.
And the best way to master data visualization is to learn what we need
when encountering problems.

Thank you for reading!