I am a big fan of the tidyverse  set of libraries, especially ggplot2

While there is a raging debate on the use of base-r vs tidyverse to teach R to beginners, I will choose tidyverse for the convenience and for the connected ecosystem of many libraries that provide a ton of convenience with using R.

In the post below I show how one can use ggplot2 to visualise the distribution of various track features from the Spotify dataset. I use RStudio for this exercise and recommend you do the same.


#   _    _      _                          
#  | |  | |    | |                         
#  | |  | | ___| | ___ ___  _ __ ___   ___ 
#  | |/\| |/ _ \ |/ __/ _ \| '_ ` _ \ / _ \
#  \  /\  /  __/ | (_| (_) | | | | | |  __/
#   \/  \/ \___|_|\___\___/|_| |_| |_|\___|
#                                          
#                                          

# Load Libraries
library(readr)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(stringr)



# Dataset is the Spotify tracks data available at https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data_o.csv

df <- read_csv("data_o.csv", col_names=TRUE)


#   _   _ _                 _ _          _   _             
#  | | | (_)               | (_)        | | (_)            
#  | | | |_ ___ _   _  __ _| |_ ______ _| |_ _  ___  _ __  
#  | | | | / __| | | |/ _` | | |_  / _` | __| |/ _ \| '_ \ 
#  \ \_/ / \__ \ |_| | (_| | | |/ / (_| | |_| | (_) | | | |
#   \___/|_|___/\__,_|\__,_|_|_/___\__,_|\__|_|\___/|_| |_|
#                                                          
#                                                          


# Distribution of danceability
ggplot(df, aes(danceability)) +
  geom_histogram() +
  ggtitle("Distribution of danceability")


# Combining plots

p1 <- ggplot(df, aes(danceability)) +
  geom_histogram() +
  ggtitle("Distribution of danceability")

p2 <- ggplot(df, aes(acousticness)) +
  geom_histogram() +
  ggtitle("Distribution of acousticness")

p3 <- ggplot(df, aes(liveness)) +
  geom_histogram() +
  ggtitle("Distribution of liveness")

p4 <- ggplot(df, aes(speechiness)) +
  geom_histogram() +
  ggtitle("Distribution of speechiness")

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2)

I choose four variables from the dataset to plot.

The above approach to tile the plots is fine but is cumbersome given one has to individually create the plots. What if we could loop through the variables and create the plots? There are a couple of ways to do this.

# Assign the columns of interest to a variable
target_variables <- c("danceability", "acousticness", "liveness", "speechiness")

# Loop through each column name
for (each_variable in target_variables) {
  # Create a variable name to which the plot will be assigned
  plot_var_name <- str_c(c("ggplot", each_variable), collapse = "_")
  print(plot_var_name)
  # Compute the mean value of the variable 
  mean_val <- round(mean(df[, each_variable][[1]]), 3)
  print(mean_val)
  # Construct the plot
  temp_plot <- ggplot(df, aes_string(each_variable)) + # NOTE - aes_string rather than aes
    geom_histogram(binwidth = 0.05) +
    ggtitle(str_c("Distribution of ",each_variable)) +  # Title of the plot
    geom_vline(xintercept = mean_val, color = "blue", lty = "dashed")
  # Assign the plot to plot name
  assign(plot_var_name, temp_plot)
}

gridExtra::grid.arrange(ggplot_danceability, ggplot_acousticness, ggplot_liveness, ggplot_speechiness, ncol = 2)

There is a lot going on in the above snippet.

We create a variable name for each target column. This variable will be assigned the plot once it is created using assign. The reason we do this is to capture the plot in its own independent variable name so we can reference it later.

Typically when we assign a value to a variable we invoke foo <- 42 in R which assigns the value 42 to the variable foo. But in the code snippet above we are looping through the column names and construction a plot each time. To save the plot each time we construct it, we assign the plot object to a variable name.

Once the loop completes, we use grid.arrange to plot.

This still feels more work than necessary, in that, you have to explicitly call out the variable names for each of the plots. Can we do better?

You can use lapply and do away with creation of variables the way we are doing the for loop.

# Use lapply 
my_plots_list <- lapply(target_variables, function(each_variable) {
  ggplot(df, aes_string(each_variable)) + # NOTE - aes_string rather than aes
    geom_histogram(binwidth = 0.05) +
    ggtitle(str_c("Distribution of ",each_variable)) +  # Title of the plot
    geom_vline(xintercept = mean_val, color = "blue", lty = "dashed")
})

gridExtra::grid.arrange(grobs = my_plots_list, ncol = 2)

In the snippet above you apply a function on each column name that constructs the plot. The plot object once constructed is returned and the variable my_plots_list contains the four plot objects.

You then pass the list of plot objects, my_plots_list to grid.arrange and your grid of variable distributions is rendered.