Skip to content

Analysis of the correlation between algorithmic complexity (approximated by LZMA compression) and musical popularity using the Lakh MIDI dataset and Spotify metrics.

Notifications You must be signed in to change notification settings

TristanDonze/music-complexity-analysis

Repository files navigation

Algorithmic Complexity & Perceived Musical Interest

Do listeners prefer the comfort of familiar patterns or the novelty of the unexpected?

This project investigates the relationship between the structural complexity of music and its popularity. By applying LZMA compression on MIDI files as a proxy for Kolmogorov complexity, we analyzed thousands of tracks across various genres to determine if "simple" music is inherently more successful commercially.

Project Goals

  • Quantify Complexity: Use compression ratios and "unexpectedness" scores to measure the structural density of musical files (MIDI).
  • Correlate with Popularity: Link these metrics with Spotify popularity scores.
  • Genre Analysis: Map the landscape of musical genres based on their algorithmic entropy.

Key Findings

  • Simplicity Bias: We observed a negative correlation between complexity and popularity. Mass audiences generally favor higher predictability and repetition.
  • The "Goldilocks Zone": Popularity peaks at low-to-medium complexity and drops conceptually as music becomes too entropic (random/chaotic).
  • Genre Clustering:
    • High Popularity / Low Complexity: Hip-hop, Metal, Punk (characterized by repetitive loops or rhythmic patterns).
    • Low Popularity / High Complexity: Jazz, Classical (characterized by variation and improvisation).
  • MIDI Limitation: The study highlights that symbolic data (MIDI) misses key information sources like vocals and timbre, which explains why lyrically complex genres like Hip-hop appear algorithmically "simple" in this analysis.

Setup & Installation

To reproduce the dataset and run the analysis, follow these steps:

1. Kaggle API Setup

You need the Kaggle API to download the Lakh MIDI dataset.

  1. Install the Kaggle client:
    pip install kaggle
  2. Create an API token by visiting: https://www.kaggle.com/settings/
  3. Place the downloaded kaggle.json file in your configuration directory:
    mv kaggle.json ~/.kaggle/
  4. Verify the installation:
    kaggle datasets list

2. Dataset Generation

Run the build script to download and extract the MIDI files into a data/ folder:

source build_dataset.sh

3. (Optional) Rebuild Metadata

If you wish to rebuild the dataset metadata (popularity scores, genres) from scratch:

  1. Obtain a Client ID and Client Secret from Spotify: https://developer.spotify.com/documentation/web-api
  2. Use these credentials in the preprocessing/build_csv.py script.

Usage

You can explore the analysis through the provided Jupyter Notebooks.

  • Performance Note: All notebooks are optimized. Dataframes for visualizations are pre-saved, so executing the notebooks to generate plots is fast and does not require re-running the heavy compression algorithms.

About

Analysis of the correlation between algorithmic complexity (approximated by LZMA compression) and musical popularity using the Lakh MIDI dataset and Spotify metrics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages