In the past two posts within our Pandas series we analyzed data of Chipotle restaurant, and Flipkart online store. Today, we're going to look at Spotify dataset from a perspective of a recording studio start-up.

Imagine wanting to get started in the music industry. You believe you have a skill for spotting great talent and promoting them to become stars. Yet, apart from your ingenious ability, you need to be familiar with facts.

You want to know:

  1. What tracks are most popular amongst Spotify users.
  2. How many tracks gained popularity over 90 out of 100.
  3. Which tracks released in March 2020 gained popularity over 80.
  4. Is there a correlation between popularity and a track's traits?
  5. How long should an average track last according to today's standards?
  6. What is the correlation between tracks' individual features?
  7. Who is currently most popular and what genres do they represent?

Data to download: https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks

For the detailed explanation of the dataset: https://developer.spotify.com/documentation/web-api/reference/#category-tracks

Pre-processing the data

Today we'll use just a few libraries. Let's import them first.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Spotify dataset  is quite huge and there are several files containing slightly different data. Today we'll use tracks and artists dataset. We'll start with the first one.

# Loading the datset
df_tracks = pd.read_csv('/content/drive/MyDrive/tracks.csv')
df_tracks

You'll see that this dataset consists of 586672 rows and 20 columns. To be sure if we can trust this dataset, it's important to check if any values are missing.

pd.isnull(df_tracks).sum().sum()
71

Pandas pd.isnull() returns a dataset with booleans True and False saying if the value is missing. Calling sum() twice on this gives us a total number of all the missing values in the dataset. If we only used it once, we'd get a sum of missing values for each column.

There are just 71 missing values in a dataset consisting of over 500k rows. It means it not bad, and our conclusions will be valid. Such a dataset is a Pythonista's delight!

To make things easier, we should change the release_date to a date type and then put months and years into separate columns. We did a similar operation last time, so you might already be familiar with this method. Let's practice.

year = df_tracks['release_date'].apply(lambda x : x.year)
month = df_tracks['release_date'].apply(lambda x : x.month)
df_tracks.insert(loc=8, column='year', value=year) 
df_tracks.insert(loc=9, column='month', value=month)

This time instead of assigning a new column by defining df_tracks['year'] we used the function insert(). This method allows us to choose the exact position of the new column ( loc ). If we had done it the old-fashioned way, the new columns would have ended up at the last possible index.

Exploring the dataset

The dataset we're using right now consists of 20 columns. We'll find here information such as name, popularity, duration, explicit, artist, release_date, and tracks' traits like danceability, speechiness, loudness etc.

Popularity is measured on the scale between 0 and 100, where 100 is the best. Given our knowledge of the music industry, let's check if what we feel is actually true. What are the most popular songs right now.

To check this, let's use a great pandas function query(). This is a filtering function that enables us passing a specific condition a value must meet to be found. The usage presented below is rather simple, but we'll go to more complicated examples, since it a very powerful function for analyzing data.

most_popular = df_tracks.query('popularity>90', inplace=False).sort_values('popularity', ascending=False)
most_popular[:10]

Notice that the whole function's expression is passed in quotation marks. It's very important!

id name popularity duration_ms explicit artists id_artists release_date danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature year month
93802 4iJyoBOLtHqaGxP12qzhQI Peaches (feat. Daniel Caesar & Giveon) 100 198082 1 ['Justin Bieber', 'Daniel Caesar', 'Giveon'] ['1uNFoZAHBGtllmzznpCI3s', '20wkVLutqVOYrc0kxF... 2021-03-19 0.677 0.696 0 -6.181 1 0.1190 0.32100 0.000000 0.4200 0.464 90.030 4 2021 3
93803 7lPN2DXiMsVn7XUKtOW1CS drivers license 99 242014 1 ['Olivia Rodrigo'] ['1McMsnEElThX1knmY4oliG'] 2021-01-08 0.585 0.436 10 -8.761 1 0.0601 0.72100 0.000013 0.1050 0.132 143.874 4 2021 1
93804 3Ofmpyhv5UAQ70mENzB277 Astronaut In The Ocean 98 132780 0 ['Masked Wolf'] ['1uU7g3DNSbsu0QjSEqZtEd'] 2021-01-06 0.778 0.695 4 -6.865 0 0.0913 0.17500 0.000000 0.1500 0.472 149.996 4 2021 1
92810 5QO79kh1waicV47BqGRL3g Save Your Tears 97 215627 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 0.680 0.826 0 -5.487 1 0.0309 0.02120 0.000012 0.5430 0.644 118.051 4 2020 3
92811 6tDDoYIxWvMLTdKpjFkc1B telepatía 97 160191 0 ['Kali Uchis'] ['1U1el3k54VvEUzo3ybLPlM'] 2020-12-04 0.653 0.524 11 -9.016 0 0.0502 0.11200 0.000000 0.2030 0.553 83.970 4 2020 12
92813 0VjIjW4GlUZAMYd2vXMi3b Blinding Lights 96 200040 0 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 0.514 0.730 1 -5.934 1 0.0598 0.00146 0.000095 0.0897 0.334 171.005 4 2020 3
93805 7MAibcTli4IisCtbHKrGMh Leave The Door Open 96 242096 0 ['Bruno Mars', 'Anderson .Paak', 'Silk Sonic'] ['0du5cEVh5yTK9QJze8zA0C', '3jK9MiCrA42lLAdMGU... 2021-03-05 0.586 0.616 5 -7.964 1 0.0324 0.18200 0.000000 0.0927 0.719 148.088 4 2021 3
92814 6f3Slt0GbA2bPZlz0aIFXN The Business 95 164000 0 ['Tiësto'] ['2o5jDhtHVPhrJdv3cEQ99Z'] 2020-09-16 0.798 0.620 8 -7.079 0 0.2320 0.41400 0.019200 0.1120 0.235 120.031 4 2020 9
91866 60ynsPSSKe6O3sfwRnIBRf Streets 94 226987 1 ['Doja Cat'] ['5cj0lLjcoR7YOSnhnX0Po5'] 2019-11-07 0.749 0.463 11 -8.433 1 0.0828 0.20800 0.037100 0.3370 0.190 90.028 4 2019 11
92816 3FAJ6O0NOHQV8Mc5Ri6ENp Heartbreak Anniversary 94 198371 0 ['Giveon'] ['4fxd5Ee7UefO4CUXgwJ7IP'] 2020-03-27 0.449 0.465 0 -8.964 1 0.0791 0.52400 0.000001 0.3030 0.543 89.087 3 2020 3

At the first sight we can see that the first 10 most popular songs were released either in 2020 or 2021 and that almost half of them contains some explicit content indicated by the binary 1 in the explicit column.

To see if our conclusions are right, let's sort the filtered values and show the columns of interest.

pop_date = most_popular.sort_values('release_date', ascending=False)
pop_date[['name', 'popularity', 'explicit','release_date']][:20]
name popularity explicit release_date
93802 Peaches (feat. Daniel Caesar & Giveon) 100 1 2021-03-19
93805 Leave The Door Open 96 0 2021-03-05
93815 What’s Next 91 1 2021-03-05
93811 Hold On 92 0 2021-03-05
93816 We're Good 91 0 2021-02-11
93813 911 91 1 2021-02-05
93809 Up 92 1 2021-02-05
93806 Fiel 94 0 2021-02-04
93808 Ella No Es Tuya - Remix 92 0 2021-02-03
93812 Wellerman - Sea Shanty / 220 KID x Billen Ted ... 92 0 2021-01-21
93810 Goosebumps - Remix 92 1 2021-01-15
93814 Your Love (9PM) 91 0 2021-01-15
93807 Friday (feat. Mufasa & Hypeman) - Dopamine Re-... 94 0 2021-01-15
93803 drivers license 99 1 2021-01-08
93804 Astronaut In The Ocean 98 0 2021-01-06
92823 Good Days 93 1 2020-12-25
92819 Bandido 94 0 2020-12-10
92811 telepatía 97 0 2020-12-04
92821 LA NOCHE DE ANOCHE 93 0 2020-11-27
92830 Dynamite 91 0 2020-11-20

We know which songs are the most popular in general, but as a good producer you need to understand human emotions and how they shape the market. In times of crisis both artists and the audience have different tastes. Last year in March, the world was terrorized by fear of a mysterious virus. I wonder what were the songs released back then, sorted by the popularity.

We're going to use query() again. This one is more complex than the previous since we are defining more conditions. First, the song's popularity must reach 80 or more, then it needs to be released in March 2020. Remember the quotation marks!

most_popular_march_20 = df_tracks.query('(popularity > 80) and (year in ["2020"]) and (month in ["3"])')
most_popular_march_20
id name popularity duration_ms explicit artists id_artists release_date year month danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo time_signature
92810 5QO79kh1waicV47BqGRL3g Save Your Tears 97 215627 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.680 0.826 0 -5.487 1 0.0309 0.02120 0.000012 0.5430 0.644 118.051 4
92813 0VjIjW4GlUZAMYd2vXMi3b Blinding Lights 96 200040 0 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.514 0.730 1 -5.934 1 0.0598 0.00146 0.000095 0.0897 0.334 171.005 4
92816 3FAJ6O0NOHQV8Mc5Ri6ENp Heartbreak Anniversary 94 198371 0 ['Giveon'] ['4fxd5Ee7UefO4CUXgwJ7IP'] 2020-03-27 2020 3 0.449 0.465 0 -8.964 1 0.0791 0.52400 0.000001 0.3030 0.543 89.087 3
92853 4xqrdfXkTW4T0RauPLv3WA Heather 89 198040 0 ['Conan Gray'] ['4Uc8Dsxct0oMqx0P6i60ea'] 2020-03-20 2020 3 0.357 0.425 5 -7.301 1 0.0333 0.58400 0.000000 0.3220 0.270 102.078 3
92867 5nujrmhLynf4yMoMtj8AQF Levitating (feat. DaBaby) 89 203064 0 ['Dua Lipa', 'DaBaby'] ['6M2wZ9GZgrQXHCFfjv46we', '4r63FhuTkUYltbVAg5... 2020-03-27 2020 3 0.702 0.825 6 -3.787 0 0.0601 0.00883 0.000000 0.0674 0.915 102.977 4
92927 7szuecWAPwGoV1e5vGu8tl In Your Eyes 86 237520 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.667 0.719 7 -5.371 0 0.0346 0.00285 0.000081 0.0736 0.717 100.021 4
92951 6KfoDhO4XUWSbnyKjNp9c4 Maniac 86 185773 0 ['Conan Gray'] ['4Uc8Dsxct0oMqx0P6i60ea'] 2020-03-20 2020 3 0.628 0.639 8 -5.460 1 0.0435 0.00162 0.000000 0.3540 0.493 108.045 4
92961 3PfIrDoz19wz7qK7tYeu62 Don't Start Now 85 183290 0 ['Dua Lipa'] ['6M2wZ9GZgrQXHCFfjv46we'] 2020-03-27 2020 3 0.793 0.793 11 -4.521 0 0.0830 0.01230 0.000000 0.0951 0.679 123.950 4
92995 5m5aY6S9ttfIG157xli2Rs Alô Ambev (Segue Sua Vida) - Ao Vivo 84 169593 0 ['Zé Neto & Cristiano'] ['487N2T9nIPEHrlTZLL3SQs'] 2020-03-26 2020 3 0.695 0.872 9 -3.650 1 0.0868 0.33400 0.000000 0.9540 0.646 121.843 4
93021 527k23H0A4Q0UJN3vGs0Da After Party 84 167916 1 ['Don Toliver'] ['4Gso3d4CscCijv0lmajZWs'] 2020-03-13 2020 3 0.629 0.692 5 -8.045 1 0.0376 0.00981 0.331000 0.6030 0.453 162.948 4
93025 017PF4Q3l4DBUiWoXk4OWT Break My Heart 84 221820 0 ['Dua Lipa'] ['6M2wZ9GZgrQXHCFfjv46we'] 2020-03-27 2020 3 0.730 0.729 4 -3.434 0 0.0883 0.16700 0.000001 0.3490 0.467 113.013 4
93071 1jaTQ3nqY3oAAYyCTbIvnM WHATS POPPIN 83 139741 1 ['Jack Harlow'] ['2LIk90788K0zvyj2JJVwkJ'] 2020-03-13 2020 3 0.923 0.604 11 -6.671 0 0.2450 0.01700 0.000000 0.2720 0.826 145.062 4
93079 7AzlLxHn24DxjgQX73F9fU No Idea 83 154424 0 ['Don Toliver'] ['4Gso3d4CscCijv0lmajZWs'] 2020-03-13 2020 3 0.652 0.631 6 -5.718 0 0.0893 0.52400 0.000579 0.1650 0.350 127.998 4
93139 39LLxExYz6ewLAcYrzQQyP Levitating 82 203808 0 ['Dua Lipa'] ['6M2wZ9GZgrQXHCFfjv46we'] 2020-03-27 2020 3 0.695 0.884 6 -2.278 0 0.0753 0.05610 0.000000 0.2130 0.914 103.014 4
93149 4lsHZ92XCFOQfzJFBTluk8 You Got It 82 203145 1 ['Vedo'] ['3wVXTWabe3viT0jF7DfjOL'] 2020-03-27 2020 3 0.762 0.433 5 -8.937 1 0.1870 0.14300 0.000000 0.1180 0.394 122.074 4
93187 2lCkncy6bIB0LTMT7kvrD1 Azul 81 205933 0 ['J Balvin'] ['1vyhD5VmyZ7KMfW5gqLgo5'] 2020-03-19 2020 3 0.843 0.836 11 -2.474 0 0.0695 0.08160 0.001380 0.0532 0.650 94.018 4
93191 6qBFSepqLCuh5tehehc1bd Like I Want You 81 260776 0 ['Giveon'] ['4fxd5Ee7UefO4CUXgwJ7IP'] 2020-03-27 2020 3 0.678 0.355 10 -7.757 0 0.0627 0.75900 0.000071 0.1140 0.438 119.772 3
93221 6bnF93Rx87YqUBLSgjiMU8 Heartless 81 198267 1 ['The Weeknd'] ['1Xyo4u8uXC1ZmMpatF05PJ'] 2020-03-20 2020 3 0.537 0.746 10 -5.507 0 0.1500 0.02360 0.000001 0.1560 0.252 170.062 4

The titles of the songs seem to be mirroring the world's mood back then: 'Save Your Tears', 'Heartbreak Anniversary', 'Maniac', 'Break My Heart'...

Features and Popularity

Probably one of the most important piece of information we need is what sort of features of a song impact its popularity.

You have a hunch that large audiences likes songs that are good for dancing. Let's see if your hunch can be validated with data?

p = df_tracks.groupby('popularity')['danceability'].mean()
plt.plot(p)
plt.xlabel('Popularity')
plt.ylabel('Danceability')
plt.title('Impact of danceability on popularity')

Bingo! How about "instrumentalness"? "Instrumentalness" of value 1 indicates that there are no words at all, the lower the value the more words the song contains.

Apparently, the more words the better. It's a pity we can't check the quality of the lyrics by any metrics, but maybe they're not that important.

'Call me maybe' (Carly Rae Jepsen) or 'I really, really, really wanna zig-a-zig-ah!' (Spice Girls) - but they're still great!

We can experiment with all the features separately or we could put them all on a matrix and show the correlations between each individual feature and popularity.

We can see that there is no significant positive correlation between popularity and a song's feature. The most positive correlation occurs between popularity "danceabilty", loudness and energy.

Let's just quickly check as well, how long a song lasts on average today. Was it always like that?

To check the most popular artists we'll use the artists dataset.

df_artists = pd.read_csv('/content/drive/MyDrive/artists.csv')

artists_popular = df_artists.sort_values(by=['popularity'], ascending=False)
artists_popular[:10]
id followers genres name popularity
144481 1uNFoZAHBGtllmzznpCI3s 44606973.0 ['canadian pop', 'pop', 'post-teen pop'] Justin Bieber 100
115489 4q3ewBCX7sLwd24euuV69X 32244734.0 ['latin', 'reggaeton', 'trap latino'] Bad Bunny 98
126338 06HL4z0CvFAxyc27GXpf02 38869193.0 ['pop', 'post-teen pop'] Taylor Swift 98
313676 3TVXtAsR1Inumwj472S9r4 54416812.0 ['canadian hip hop', 'canadian pop', 'hip hop'... Drake 98
144484 3Nrfpe0tUJi4K4DXYWgMUX 31623813.0 ['k-pop', 'k-pop boy group'] BTS 96
115490 4MCBfE4596Uoi2O4DtmEMz 16996777.0 ['chicago rap', 'melodic rap'] Juice WRLD 96
144483 1Xyo4u8uXC1ZmMpatF05PJ 31308207.0 ['canadian contemporary r&b', 'canadian pop', ... The Weeknd 96
144485 66CXWjxzNUsdJxJ2JdwvnR 61301006.0 ['pop', 'post-teen pop'] Ariana Grande 95
144486 1vyhD5VmyZ7KMfW5gqLgo5 27286822.0 ['latin', 'reggaeton', 'reggaeton colombiano',... J Balvin 95
115491 7iK8PXO48WeuP03g8YR51W 5001808.0 ['trap latino'] Myke Towers 95

You might have noticed that top ten songs and artists differ. Justin Bieber is an unquestionable king of pop, but though their songs are not most popular right now The Weeknd, Taylor Swift or Drake are the listeners' favourites too.

One last look at the genres itself and we're done.

pop_genres = df_artists.sort_values(by ='genres').sort_values('popularity', ascending=False)
pop_genres[['genres', 'popularity']][:30]
genres popularity
144481 ['canadian pop', 'pop', 'post-teen pop'] 100
313676 ['canadian hip hop', 'canadian pop', 'hip hop'... 98
126338 ['pop', 'post-teen pop'] 98
115489 ['latin', 'reggaeton', 'trap latino'] 98
144484 ['k-pop', 'k-pop boy group'] 96
115490 ['chicago rap', 'melodic rap'] 96
144483 ['canadian contemporary r&b', 'canadian pop', ... 96
115491 ['trap latino'] 95
144487 ['dance pop', 'pop', 'uk pop'] 95
144485 ['pop', 'post-teen pop'] 95
144486 ['latin', 'reggaeton', 'reggaeton colombiano',... 95
144488 ['detroit hip hop', 'hip hop', 'rap'] 94
313813 ['rap', 'slap house'] 94
117022 ['north carolina hip hop', 'rap'] 93
144489 ['puerto rican pop', 'trap latino'] 93
144490 ['dance pop', 'pop', 'post-teen pop'] 93
313701 ['atl hip hop', 'atl trap', 'rap', 'trap'] 93
126409 ['latin', 'puerto rican pop', 'reggaeton', 'tr... 93
145802 ['dfw rap', 'melodic rap', 'rap'] 93
144491 ['alternative r&b', 'pop'] 92
144139 ['chicago rap', 'rap'] 92
112978 ['brooklyn drill'] 92
144138 ['pop', 'uk pop'] 92
144140 ['emo rap', 'miami hip hop'] 92
144141 ['latin', 'reggaeton', 'reggaeton flow', 'trap... 92
126658 ['electropop', 'pop'] 92
313508 ['barbadian pop', 'dance pop', 'pop', 'post-te... 92
115492 ['dance pop', 'pop', 'pop rap'] 91
222696 ['melodic rap', 'philly rap', 'rap', 'trap'] 91
125171 ['brooklyn drill', 'melodic rap', 'nyc rap'] 91

Apart from all kinds of pop, there's latino, rap, hip hop and r&b.

Now you know what to do to be a successful producer and use data and analytics to your advantage to pick the next big hit.

Good luck!