chord identiication in real-time using a neural network

Chord Identiication in Real-Time Using a Neural Network

Giró Serratosa, Daniel

Curs 2018-2019

Director: Sergio Ivan Giraldo Mendez

GRAU EN ENGINYERIA EN SISTEMES AUDIOVISUALS

Trebal l de Fi de Grau

Chord Identification in Real-Time Using a Neural

Network

Daniel Giró Serratosa

TREBALL FI DE GRAU

Enginyeria en Sistemes Audiovisuals

ESCOLA SUPERIOR POLITÈCNICA UPF

2019

DIRECTOR DEL TREBALL

Sergio Ivan Giraldo Mendez

3

A la meva mare, al meu pare i als meus germans.

5

Agraïments

Primer de tot, agraïr a Sergio Giraldo per tutoritzar aquest treball i ajudar-me a

construir-lo. A més, voldria agrair als meus companys, amics i família per els seus savis

consells i la seva comprensió.

7

Resum

The necessity of musicians to transcribe by ear a music piece is a common known

problem, given that the music score is not always available. Transcribing by ear is a

difficult task, and for this reason, the automatic music transcription has been researched

over the years.

In this work we focus on the task of chord recognition, where we have implemented a

a real-time musical chords detection system. First, we build a data set of audio

recordings, that consists in all the major and minor chords played by a guitar in different

ways. Later we label the data using 25 classes, 12 for all the major chords, 12 for all the

minor chords, and the last one that consists in the class “None”. From the audio signal

we extract a feature vector using the Chromagram. The Chromagram is obtained by an

spectral analysis grouping the peaks corresponding to notes in to bins. With these

features the system is trained to classify the considered chord classes.

The first models implement major and minor chords, in the future, there will be

implemented more chords like augmented, diminished, suspended etc.

The system consists of a real-time implementation, where the sound coming through

the mic, pass through the system and it returns a chord prediction, the objective is to

build a real time interface to give visual feedback to the user.

Resum

La idea d’aquest treball neix de la necessitat dels músics de transcriure d’oïda una peça

musical, no sempre es disposa de partitura i transcriure d’oïda pot ser una gran eina, no

obstant, és una tasca molt difícil, per aquesta raó s’ha desenvolupat aquest treball.

Aquest treball presenta un sistema de detecció d’acords musicals a temps real.

Primerament, construim una base de dades de gravacions d’audio, que consisteixen in

tots els acords majors i menors tocats per una guitarra de maneres diferents. Després

etiquetem les dades utilitzant 25 classes, 12 per els acords majors, 12 per els acords

menors i la última per la classe “None”. De la senyal d’audio extraiem un vector de

característiques del so anomenat cromagrama. El cromagrama fa un recompte de totes

les notes que estan sonant i les agrupa per blocs, cada nota és un bloc. Amb aquestes

característiques entrenem un sistema per què aprengui a distingir i classificar aquests

acords.

Els primers models només implementaven acords majors i menors, futurament seràn

implementats altres com acords augmentats, disminuïts, suspesos etc.

La complicació d’aquest mètode és la implementació a temps real, que el so entri per un

micròfon, passi pel sistema i retorni un acord, l’objectiu is construir una interfície visual

que a temps real doni informació visual a l’usuari.

9

Prefaci o pròleg

El Treball de Fi de Grau presentat a continuació anomenat “Chord Identification in

Real-Time Using a Neural Network” neix de la dificultat per part dels músics a

transciure l’harmonia. El projecte s’ha dut a terme sota la tutela de Sergio Iván Giraldo,

al qual vull donar les gràcies per la seva gran orientació i ajuda en aquest treball.

11

Índex o sumari Pàg.

Resum............................................................................ 7

Prefaci o Pròleg............................................................. 9

Llista de figures............................................................. 13

Llista de Taules.............................................................. 15

1. INTRODUCTION.................................................... 17

1.1. The problem.................................................... 17

1.2 Related Work.................................................... 17

1.3 Objectives ........................................................ 18

2. STATE OF THE

ART................................................

21

2.1. Background...................................................... 21

2.2 Harmony Theory............................................... 21

a) Musical Chords............................................. 21

b) Chord Inversions........................................... 22

c) Extended Chords........................................... 26

3. MATERIALS AND METHODS............................... 27

3.1. External Codes................................................ 27

3.2 Database.......................................................... 29

a) First database trial......................................... 29

b) Offline database............................................ 29

c) Real-time database........................................ 31

4.

METHODOLOGY.....................................................

33

4.1. Feature Extraction............................................ 33

a) Offline implementation................................. 33

b) Real-time implementation............................. 36

4.2 Classifier........................................................... 37

a) Offline implementation................................. 37

b) Real-time implementation............................. 37

5. RESULTS.................................................................. 39

5.1. Parameters definition....................................... 39

a) Peak Threshold.............................................. 40

b) Number of peaks........................................... 42

c) Peak Weights................................................. 44

d) Ponderation................................................... 45

e) Octave Weights............................................. 46

5.2 Classifier.......................................................... 47

a) Setting a baseline........................................... 47

b) The classifiers ............................................... 47

6. DISCUSSION……………………………............. 51

6.1. Conclusion………………………….............. 51

6.2 Future Work………………………................. 52

Bibliografia…………………………………............... 54

13

Llista de figures

Pàg.

Fig. 1. Seriousness of Errors............................................ 19

Fig. 2. Harmonic Analysis in Chord Progressions.......... 20

Fig. 3. Helix of Fifths………………………………….. 23

Fig. 4. Different Positions of C Major…………………. 24

Fig. 5. D minor Chromagram………………………….. 25

Fig. 6. Bb major Chromagram……………………..….. 25

Fig. 7. G diminished Chromagram……………………. 26

Fig. 8. F-measure equation……………………………. 27

Fig. 9. Precision and Recall equation…………………. 27

Fig. 10. SVM Hyperplanes……………………………. 28

Fig. 11. MIDI Recordings in DAW………………….... 30

Fig. 12. Labeling in Sonic Visualizer…………………. 30

Fig. 13. Arff File Structure……………………………. 31

Fig. 14. SFTF Equation……………………………….. 33

Fig. 15. Irrelevant Peaks in FFT………………………. 33

Fig. 16. Frequency to Note Algorithm………………... 34

Fig. 17. RMS Equation………………………………... 35

Fig. 18. Threshold…………………………………….. 35

Fig. 19. Overlapped Chromagram…………………….. 36

Fig. 20. Algorithm Scheme……………………………. 36

Fig. 21. Ponderation Function…………………………. 37

Fig. 22. Program Scheme………………………………. 38

Fig. 23. Chromagram with Thresholds……………….. 41

Fig. 24. Accuracy depending on Thresholds………….. 41

Fig. 25. Chromagram with different N………………... 43

Fig. 26. Accuracy depending on N………………….. 43

Fig. 27. Chromagram with Peak Weights……………. 44

Fig. 28. Accuracy depending on Peak Weights………. 44

Fig. 29. Chromagram applying Ponderation………….... 45

Fig. 30. Chromagram with Octave Weight…………….. 46

Fig. 31. Accuracy depending on Octave Weight………. 46

Fig. 32. Baseline Precision, Recall and F-measure……. 47

Fig. 33. Precision of Classifiers……………………….. 48

Fig. 34. Recall of Classifiers………………………….. 48

Fig. 35. F-measure of Classifiers……………………... 49

15

Llista de taules

Pàg.

Taula 1. Chord definition................................................ 21

Taula 2. Table 2. Intervalic Relation (IR) in Chords....... 22

Table 3. IR in First Inversions…………………………. 22

Table 4. IR in Second Inversions………………………. 23

Table 5. IR with Octaves in Chords……………………. 24

Table 6. IR with Octaves in Dispositions………...……. 24

Table 7. C Major Chromagram………………………... 34

17

1. INTRODUCTION

1.1 The Problem A major challenge for musicians and music students is the chord and harmonic

progression recognition. Mastering this skill has important implications when

performing in “Jam Sessions”, a Jam Session is a musical event where musicians,

usually, intrumentalists improvise music without extensive preparation. Also, when it

comes to transcribing a musical piece by ear. However mastering this skill is not

always an easy task. This is the motivation for investigating on automatic methods to

transcribe music harmony (i.e. chord progressions) in real time.

An application scenario for this method could be the following: musicians accustomed

to searching on the Internet for the chords of a musical piece. These chord transcriptions

are usually done by average users with little musical training. Therefore, there might

be errors, or an user may be looking for the chords of an specific version (e.g. with a

reharmonization), or the chord progression for an specific piece might not be available..

A system of this nature can permit musicians to process the audio file in real time

obtain a visualization of the ongoing chords.

1.2 Related Work

Automatic chord recognition is a topic that has been researched in the past, within the

domain of audio description. Fujishima [1] developed a chord recognition system using

a chromagram, that analyses pitches categorized as notes. Gómez and Herrera [2]

developed a system that extract tonal information building a Harmonic Pitch Class

Profile (HPCP), a Harmonic Pitch Class Profile is a vector of low-level instaneous

features, which represents the pitch intensity content of polyphonic music signals,

mapped to a single octave.

The main idea of this method is to extract the harmony of a musical piece. So, in

addition to the tonal descriptors, there are also important considerations. For example,

the Christopher Harte, Mark Sandler and Martin Gasser’s work about Detecting

Harmonic Change in Musical Audio [3]. They have built the Harmonic Change

Detection Function (HCDF) based in the chromagram. It’s an example of how the tonal

descriptors can be applied. The Masataka Goto work about a Real-time Beat Tracking

System [4] generates a hierarchical beat structure that helps a lot to determine where has

the chord changed, but I have decided to focus on the work not on the time domain, but

it can be a big great improvement of the work.

When it comes to the harmonic transcription, what the musicians mostly use is focusing

on the bass and comparing its relation with the tonality. So it’s interesting the Satoru

Hayamizu work about detecting Melody and Bass [5] together with the Emilia Gómez

work about tonality [6], but I have finally focused on the relation between notes and

how they make chords, specially the Kyogu Lee work [7] and the Emilia Gómez

methods.

Emilia Gómez developed some methods to extract tonal descriptors [8] and they have

helped a lot through the process of building this method, Emilia Gómez together with

18

Joan Serrà has also developed a cover identification system based on the tonal

descriptors [9], they compare the features between the original song and its cover so it’s

not that relevant to the work, but it helps to understand how the tonal descriptors work.

Our method has two parts: the feature extraction and the classifier using machine

learning algorithms. This method is implemented real-time so it’s needed what Pascanu,

Gulcehre, Cho and Bengio explain in his work, a Recurrent Neural Network [10], It

needs an input that changes in time and the output that does it too, so the RNN seems to

be a good solution. Honglak Lee, Yan Largman, Peter Pham and Andrew Y.Ng used the

deep belief networks applied to audio classification [11] and Philippe Hamel and

Douglas Eck used them to learn features from Music Audio [12]. Those articles help me

to understand how my system should work.

1.3 Objectives

This method proposes a tool for the musicians, a real-time harmonic transcription of the

music using as a feature the chromagram and using machine learning techniques.

My main goal is to get a good accuracy in this classification task, first of all, in an

offline method. Passing through the program a song and the program would return me

the right chord, allowing the musician to just do sight reading of the chords, as they do

nowadays with Internet, but Internet doesn’t have all songs or it doesn’t have the

reharmonized version that you want. So, with this method you can look up for the

chords of any music audio.

The idea is to implement it in real-time, so, once I get a good accuracy, I check the

performance in real-time and the objective is to achieve a good result. It has to be

understandable and logic for the musician. In other words, if the guessed chord is

wrong, it is obvious that it’s a bad result, but it can be other problems that are not

necessary programs mistakes but makes the result confusing. For example: A chord that

changes so fast because it’s not detecting it right, or enharmonic problems of the

tonality.

Computing an accuracy, it’s easy but not always reflect if the functioning is correct.

One wrong classified chord could be better or worse than another, because it is also

important the tonal function of the chord (Harmony theory explained in the Music

Theory Subsection in the State of the Art Section). For example: Let’s suppose that we

have a song in C major, if the program returns and F major but the real chord is D

minor, it’s a mistake but it’s a better mistake that if the program would have returned a

Bb, because the tonal function of the chords is the same, subdominant. Or Db7 instead

of G7 (the tonic is C major), because Db7 is the tritonal substitute of the G7 and they

perform the same function, dominant.

It is difficult to set a formula to compute the functioning taking into account the tonal

function or other important features in the harmonic progressions. In most cases, if the

chords that the program has confused share some notes, the mistake would be less

serious.

19

Here are some examples of possible program confusions (The tonic is C major).

F major D minor Less serious mistake

{F, A, C} {D, F, A}

Db7 G7 Less serious mistake

{Db F, Ab, B} {G, B, D, F}

C major G major Serious mistake

{C, E, G} {G, B, D}

F major Ab minor Very Serious mistake

{F, A, C} {Ab, Cb, Eb}

Fig. 1. Seriousness of Errors

A future objective is make an improvement of the system, by implementing new chords

and improving the algorithm and making it more efficient. The main tools that I have

used to build it are Openframeworks, to extract the features and Wekinator to perform

the classification and an improvement would be substituting the Wekinator application

by code in Openframeworks, because Wekinator is a great tool, it’s open source and so

practical, but it has some complications when it comes to training or to choosing the

classifier, so a classifier implemented in openframeworks would be more modifiable

and would give better results.

Also, to improve the system one option is to implement in addition to the chord

recognition, a tonal functionality detection system, in other words, not to identify the

chord by looking at its note, but looking at the tonal function and the tonal elements, but

the real time implementation is almost impossible.

20

Mixing Chord recognition and Tonal Functionality recognition algorithm could have a

good result. Here are some examples of harmonic progressions that are commonly used

and they are more simple to look them as tonal functions.

C major F major Ab minor Eb7 D minor G7

{C, E, G} {F, A, C} {Ab, Cb, Eb} {Eb, G, Bb, Db} {D, F, A} {G, B, D, F}

I IV [VI-bs IIbs] II V

C major A7 D7 G7

{C, E, G} {A, C#, E, G} {D, F#, A, C} {G, B, D, F}

I [Ve] [Vs] V

Fig. 2. Harmonic Analysis in Chord Progressions

21

2. STATE OF THE ART

2.1 Background

It has been a lot of research in the music and audio descriptors, most of these studies did

not influence this work directly, but they help to set the idea of this project, Articles that

describe audio descriptors but focused in another aspects of the music, not in the

harmony, like rhythm, melody, texture etc.

There are some articles that help to understand the audio descriptors, articles that

explain music similarities methods [13] or tonal descriptors like HPCP and how useful

could be these descriptors [9], or explain some machine and deep learning algorithms

[10] [11].

This method uses the Chromagram as a feature extractor and some machine learning

algorithms, KNN (K-nearest neighbour), NN (Neural Network) or SVM (Support

Vector Machine).

This method does not use any database already built. It has been built a database for this

work, the database is stored in GitHub [I].

It uses an application already built that open an interface and it extract some features

[II], it has been implemented the feature extractor, the chromagram, by modifying this

application.

2.2 Harmony Theory

a) Musical Chords

A Musical Chord is defined by the chord type and the height.

A major E minor

Height Chord type Height Chord type

A

Major

E

Minor

Table 1. Chord definition

The Chord height or Chord key is defined by the root. There are some type of chords

depending on the relation of the notes, the triad chords have three notes, the root, the

third and the fifth, and the relation of tones and semitones between these notes will

define the chord type.

22

Major

1

2t

3

1’5t

5

Minor

1

1’5t

b3

2t

5

Augmented

1

2t

3

2t

#5

Diminished

1

1’5t

b3

1’5t

b5

Table 2. Intervalic Relation (IR) in Chords

b) Chord Inversions

But the root is not always the lower note, there are chord inversions, sometimes the

lower note is the third or the fifth. These are the chord structure in the 1rst inversion

(lower note is the third) and the 2nd inversion (lower note is the fifth).

First inversion:

Major

3

1’5t

5

2’5t

1

Minor

b3

2t

5

2’5t

1

Augmented

3

2t

#5

2t

1

Diminished

b3

1’5t

b5

3t

1

Table 3. IR in First Inversions

23

Second inversion:

Major

5

2’5t

1

2t

3

Minor

5

2’5t

1

1’5t

b3

Augmented

#5

2t

1

2t

3

Diminished

b5

3t

1

1’5t

b3

Table 4. IR in Second Inversions

But the relation between notes and the frequency is a helix.

Fig. 3. Helix of Fifths

So sometimes, the relation of tones and semitones between the notes of the chords is the

following but adding 12 semitones (semitones of the whole chromatic scale) but the

colour of the chord doesn’t change.

24

Major

1

8t (12st + 2t)

3

7’5t (12 st + 1’5t)

5

Table 5. IR with Octaves in Chords

These are different disposition of the same chord: C major (C, E, G)

Fig. 4. Different Positions of C Major

1

1

3’5t

5

4’5t

3

2

1

2t

3

7’5t (12st + 1’5t)

5

3

3

4t

1

3’5t

5

Table 6. IR with Octaves in Dispositions

25

In the Chromagram you can see the features of the chords (height and type). For

example:

D minor:

C C# D Eb E F F# G Ab A Bb B

Fig. 5. D minor Chromagram

Bb major:


Fig. 6. Bb major Chromagram

26

G diminished:


Fig. 7. G diminished Chromagram

c) Extended Chords

There are also the quatriad chords and extended chords, that are chords that has another

notes like the seventh or the ninth. But this method doesn’t take into account these type

of chords.

27

3. MATERIALS AND METHODS

3.1 External Codes

This work consists in two parts: the offline implementation and the real-time

implementation, and each one has the feature extractor and the classifier part.

I have used Ableton Live and Sonic Visualizer to build the database (Better explained in

the database section).

First of all, in the firsts trials, I have used some VAMP Plugins for Sonic Visualizer like

HCPC - Harmonic Pitch Class Profile, developed by Music Technology Group in

Universitat Pompeu Fabra [III] or Invariant Pitch Chroma, developed by Queen Mary,

University of London [IV] but after trying their execution and seeing bad results I have

dismissed the VAMP Plugins option and I have tried to developed my own feature

extractor using as a model the Chromagram.

For the offline implementation, I have used Matlab software to extract the features, I

have built a code that reads all the files from the database and extract features of them.

There is one function in the feature extraction code that converts frequency to notes that

I have extracted from the Mathwork repository [V].

All the features from the files are stored in txt files and these files are read in a C++

Text Parser code, in Microsoft Visual Studio 2017 [VI].

The result of the C++ code is what is read by the Weka Software [VII] who does the

classification.

Weka offers you a variety of classifiers and, depending on your project, one would be

better than another. In order to calculate how good is the functioning of the system it is

computed the f-measure (F1).

Fig. 8. F-measure equation

Where,

Fig. 9. Precision and Recall equation

tp = true positive

fp = false positive

fn = false negative

In order to determine if the f-measure is good enough, the classifiers are compared with

the ZeroR classifier.

28

The ZeroR classifier consists in classifying all instances as the class that has more

instances. In this work, there are a lot of classes, one class per chord. So the ZeroR f-

measure value is so low.

First of all, I have tried with a Multilayer Perceptron [14]. A Multilayer Perceptron

consists in, at least, three layers of nodes, the input layer, the hidden layer or hidden

layers and the output layer, each nodes represents a neuron. It uses a supervised learning

technique called backpropagation for training.

After multiple trials with different parameters, I have tried with a Support Vector

Machine (SVM) classifier [15].

A SVM classifier is a supervised learning model. The training algorithm works placing

a hyperplane that separate the classes. The dimension of the space could be high and the

best hyperplane will be the one that separates the most the classes.

In the image, H1 doesn’t separate the classes correctly, H2 separate the classes and H3

does it too but it separates them more. H3 would be the correct hyperplane.

Fig. 10. SVM Hyperplanes

Finally, I have tried a K-Nearest Neighbour (KNN) classifier [16]. A KNN classifier

consists in defining regions in a n-dimensional place (n is the number of inputs).

The execution results of these methods are in the Experimental Results section.

For the Real-Time Implementation, Matlab is not a good tool anymore, so I tried to

build a C++ code in Visual Studio to extract the audio features, I looked up for some

functions, like a Fast Fourier Transform (FFT) [VIII] or other functions like findpeaks

that MATLAB has already implemented [IX] but finally I ended not using them

because it was faster to implemented by myself the function rather than understanding

their functioning and making them work.

I have also tried to open MATLAB from the C++ code using the MATLAB Engine API

for C++ [X] but finally, I have opted for Essentia.

Essentia [XI] is an open-source library for audio and music analysis, is only available

for Mac, and I have used Windows. So I found an Openframeworks App [XII] in Visual

Studio. This app extract audio features in real-time, features like MFCCs, FFT etc.

[XIII]. I have used this code to compute the Chromagram, especially the FFT.

29

The functions that I've used in the Offline implementation that are already implemented

in MATLAB I have written them.

I tried to implement all the Machine learning classifiers in the app using a Neural

Network Addon [XIV] but I can't get to make them work together. So I use Wekinator

[XV]. Wekinator is an open source software that allows to use machine learning

algorithms in real-time.

The Openframeworks app and the Wekinator communicate using the OSC protocol

[XVI].

3.2 Database

b) First database trial

For this system, it’s needed a database where the chords are well labelled and separated

since when we do the feature extraction of the training part, the system will label

windows, so it’s important that the chords are well separated.

The first databases trials consisted in using the chords of a song, but the labelling

process was hard, slowly and inaccurate. For this reason, it was created the database

called DB Guitar Chords. This database consists in all the major and minor chords

played in 8 different positions simulating different chord positions of a guitar.

b) Offline database

The complexity of this database resides in the labelling, it is slow. For this reason, they

were recorded a group of MIDI tracks, each track has all the major and minor chords in

8 guitar position. All the chords are ordered within the track, in other words, first of all

the track has all the C major chords, then all the C# major chords and so on. In addition,

the spaces between chords and the duration of the chords are always the same. There are

three types of track, ones that play all the notes of the chord at the same time and they

let it sound, the ones that play all the notes but making an arpeggio and the ones that

play the chord strumming. Each type of track has 8 different MIDI velocities. The

process of obtaining of these MIDI track is fast, because it simply consists in recording

3 tracks (Chord, Arpeggio and Strumming) and all the other tracks are modified quickly

and easily in the corresponding DAW.

30

Fig. 11. MIDI Recordings in DAW

Once we have the MIDI tracks, we export them as audio. We use Kontak5 and Native

Instruments to simulate the different guitar sounds. It is interesting to use virtual

instruments because you make sure that the audio tracks don’t have any unwanted

noise.

The final result are 60 audio tracks and each track contains all the chords in 8 different

positions, the chords are ordered so the labelling process is very easy. Once we have the

audio files, we define some regions, where each region belongs to a chord, the

definitions of these regions is performed by creating a txt file with Sonic Visualizer.

Once we have the regions defined and the chords of every region, all the windows that

belong temporally to that region will be labelled as that chord.

Fig. 12. Labeling in Sonic Visualizer

31

The assignation between chords and windows is performed in a C++ code. What it does

is reading all the audio filed from the database, it splits them into windows and it

extracts all the corresponding features of each window, once it has the feature vector for

every window is added the chord in the end of the vector the corresponding chord

(checking the regions txt file and the chords txt file). All the features vectors are written

in an Arff file, so it can be used later in Weka.

The Arff file has this structure:

@RELATION Chords

@ATTRIBUTE Band_0 NUMERIC












@ATTRIBUTE Chord

{None,C,C#,D,Eb,E,F,F#,G,Ab,A,Bb,B,Cm,C#m,Dm,Ebm,Em,Fm,F#m,Gm,Abm,Am,

Bbm,Bm}

@DATA 0.2573,0.2863,0.2573,0.3470,0.3239,0.2631,0.2805,0.2949,0.2776,0.2516,0.3181,0.2891, E

0.2455,0.2835,0.2630,0.3507,0.3215,0.2747,0.2922,0.2747,0.2805,0.2484,0.3185,0.2922, E

0.2477,0.2831,0.2536,0.3539,0.3214,0.2654,0.3008,0.2713,0.2890,0.2448,0.3273,0.2831, E

0.2457,0.2872,0.2516,0.3434,0.3198,0.2576,0.3109,0.2813,0.2842,0.2457,0.3227,0.2931, E

0.2428,0.2902,0.2369,0.3169,0.3287,0.2665,0.3139,0.2873,0.2843,0.2517,0.3228,0.3021, E

...

Fig. 13. Arff File Structure

c) Real-time database

This database was built to implement the model offline. The trained Weka model cannot

be loaded in the Wekinator. So, we have used the audios from the offline database to

pass them through the program, extract the features and manually perform a real-time

chords and regions delimitation in the Wekinator. It’s a more difficult and slower task

than building the previous database, so not all the files are used, because they are too

much. The Real-time database has three audio files (from the Offline database), one

consisting in all the notes of the chord sounding at the same time, one arpeggiated chord

and one strummed chord, each audio file contains all the chords.

33

4. METHODOLOGY

4.1 Feature Extraction

a) Offline implementation

The Feature Extractor is based on the Chromagram Feature. The Chromagram consists

in making a count of certain frequencies and its multiples, in other words, the musical

notes.

We perform a STFT with a blackmanharris windowing on the audio signal,

Fig. 14. SFTF Equation

we extract the local maximas (peaks) in the spectrum

we sort the peaks in descending order, so the higher ones are the first ones, we

normalize them

We keep a part of the peaks and we discard the rest, we keep the first length/8 (They are

sorted) We do this step to discard all those maximas that are irrelevant, here’s a

graphical example:

Fig. 15. Irrelevant Peaks in FFT

We transform them to notes using the following algorithm:

34

Fig. 16. Frequency to Note Algorithm

We compute the weights, there are two types of weights: depending on the octave and

depending on the peak. The peak weights serve to give more importance to the notes

that sound louder, this helps to give less importance to higher harmonics. But it can

confuse the program if some instrument or a voice plays or sings a note outside of the

chord. The octave weights serve to give less importance to higher harmonics and more

to the lower notes, the core of the chord, the problem is that the FFT is less accurate in

the low frequency so it’s difficult to determine those weights.

Once the weighting is done, the only task remaining is to perform the count of every

note. We count all the peaks belonging to a C, we count them and we place the result in

the C bin, there are 12 bins, one per note. So finally we will have something like that:

Table 7. C Major Chromagram

And we now can tell the chord that it has been sounding, we only have to look the

higher three values (in case of triads) or more values (This task is performed by the

classifier so we don’t have to worry to look nothing).

But sometimes in music, there’s no chord, in a solo of a lead instrument like brass with

any kind of accompaniment, in a percussive fragment or simply a silence. We have

developed a very simple trick to solve that.

We calculate the RMS (root-mean square) of the FFT:

Fig. 17. RMS Equation

35

We compare the RMS value with a threshold and if it’s below it the chord would be

none, if it’s higher the chord would be the output of the classifier, the predicted one.

Fig. 18. Threshold

This Feature Extractor has a trade-off between notes accuracy and the time accuracy.

If the method performs the count of the notes only in one window, it will be very

exposed to errors, so the solution is to perform the count by summing all the notes in a

group of windows. Noise is random so the noise peaks will change through the

windows, so it’s a way to increase the difference between the note bins, so it will be less

exposed to errors.

But, in a song, the chords are constantly changing, so if it chooses to performs the count

in 100 windows, the Chromagram will be overlapping all the chord, giving a confusing

result and they won’t be classified correctly.

Fig. 19. Overlapped Chromagram

36

b) Real-Time Implementation

The Real-Time Implementation work as the offline implementation.

It makes the windowing in real-time and then it performs all the steps described in the

offline implantation section, finding the peaks, selecting the highest ones, transforming

them into notes and placing them into bins.

So the whole structure would look like this:

Fig. 20. Algorithm Scheme

37

There is a ponderation in the FFT, I have defined a function (better explained in the

Results section) that looks like this:

Fig. 21. Ponderation Function 4.2 Classifier

a) Off-line implementation

The Classifier consists in a KNN (K-nearest neighbour), The KNN algorithm is trained

with multidimensional vectors in a multidimensional feature space, each vector has a

class label, so in the classification task, the algorithm places the new vector in the space

and calculates the Euclidean distance to the centre of the clusters (each cluster would be

a chord), and the closest one is the predicted one.

The classifier is trained with some examples of a musical part with no chords,

percussion parts like a drum groove, strumming the guitar muting the strings so making

a percussive sound or only a melody with no harmony behind. With this training the

program differentiates the parts with no harmony, no chords and the parts with chords.

Weka is a big tool for Machine Learning algorithm, it helps you to implement some

classifiers to your data, and the results were good, as shown in the Experimental Results

part. But it cannot be used in Real-time. I have used for that Wekinator.

b) Real-Time Implementation

Wekinator allows users to use machine learning in real-time, specifically a Neural

Network. It would have been perfect if I could load a trained classifier into Wekinator

and used it to classify the chords in real-time. But it’s not possible, Wekinator and

Weka are free, open source software and they’re are in some aspects limited.

38

So, I have to train again the system. I still have the chord database, so It will be fast, but

no as extended as the training did with Weka.

The functioning of the classifier is the following: it receives as an input the features of

the signal, the chromagram, so it will have 12 inputs. It performs the classification and

it sends the result.

The Wekinator program will have, initially 24 nodes, one per chord (12 major chords

and 12 minor chords). The idea is to implement more chords in the future. So

Wekinator will return 24 values and in the program we keep the highest and we look

what node is, it will be the chord.

This sending of messages is performed using the OSC protocol. The Wekinator receive

the features with a message sent by the program using the port 6448 (default) and it

sends the result through the port 12000. The whole structure would look like this:

Fig. 22. Program Scheme

39

5. RESULTS

5.1 Parameters definition

To improve the precision of the program, there are some parameters that modify the

feature extractor and improve its performance.

The program performs a count of peaks (translated to notes), so it is important to define

which peaks are important (chord notes) and which are not (noise or superior harmonics

that are notes that don’t belong to the chord).

So, the first parameter is the peak threshold, if the peak is below the threshold is not

computed as a note, it defines also the silences between chords. But sometimes, you

will have noise above the threshold, so the solution is to not compute all the peaks,

compute the n-higher ones. This way a higher peak is more important and is not

computed in the same way as a lower peak.

There is also another ponderation to counteract the FFT characteristics. The FFT has a

regular resolution in all frequencies, but our perception of the frequencies is

logarithmic, so in the low frequencies the FFT won’t compute the notes correctly,

because in a 10 Hz bin in the high frequencies is not even a semitone and in the low

notes is maybe 2 tones.

So the ponderation is a function that gives more importance to the high frequencies, but

as we increase the frequency, there are more harmonics, and these harmonics may be

notes the are outside the chord. In a C chord (C, E, G) are also Bb, G#, B, F, D.

There are also other parameters but they are not modifiable because of the frequency

and time trade-off. Parameters such as the FFT Size, overlap, the number of previous

windows that also count etc. The parameters that I have used are the following:

Window size = 4096 samples

Overlap = 4 windows

Sampling Rate = 44100 samples/s

Accumulative = 5 windows

And we can calculate that each window grabs from the window itself until 0.232

seconds behind. We calculate it using this formula:

The results will be displayed with the precision, recall and f-measure and using as a

classifier a NN (Neural Network), because it will be the classifier used in the real-time

implementation. All these results are computed in the off-line implementation and performing a cross-validation.

40

a) Peak Threshold

The threshold could not be so high because it is important that any note of the chords

get lost. So, if it’s a low value the removed peaks will be only noise. Here’s the

evaluation with different threshold:

41

Fig. 23. Chromagram with Thresholds

Fig. 24. Accuracy depending on Thresholds

To compute the evolution of the accuracy depending on the Thresholds we have set the

other parameters constant.

42

b) Number of peaks

The peaks are sorted in descending order having the first ones as the higher peaks, so it

is interesting to keep the n-first in order to remove noise and keep the important peaks

(notes) only. The program detects a number of peaks, is not always the same number, in

fact, is constantly changing so define the number of peaks as a number is dangerous, it

is better to compute it as a fraction of the total number of peaks. I have tried with

different fraction. Here we can see how a Chromagram can change depending of the

number of peaks.

43

Fig. 25. Chromagram with different N

Fig. 26. Accuracy depending on N

To compute the evolution of the accuracy depending on the number of peaks we have

set the other parameters constant.

44

c) Peak Weights

It is to interesting to not compute the same way a low peak and a high one, so when the

program computes the chromagram, it sums the number of notes but this each note

(note) is multiplied by a weight multiplied by its height. Here’s the difference between

applying peak weights and not applying them.

Fig. 27. Chromagram with Peak Weights Fig. 28. Accuracy depending

on PW

To compute the evolution of the accuracy depending on the peaks weights we have set

the other parameters constant.

45

d) Ponderation

I has tried different ponderation and the best function has this shape, in the x-axis there

are the cut frequencies:

Fig. 21. Ponderation Function

Here we can see the difference between using the ponderation function and not using it,

the ponderation function helps to distinguish better the notes.

Fig. 29. Chromagram applying Ponderation

46

e) Octave Weights

The last parameter is the octave weights; it consists in applying a weight to the notes

(peaks) depending in the octave of the note, the lower the note the higher the weight.

This goes against the idea of the ponderation (section d) of counteracting the FFT

characteristics, and that is the reason of its bad performance.

The idea came from observing how musicians transcribe harmonically, they specially

focus on the bass (the root note). Here’s how octave weights affect the chromagram.

Fig. 30. Chromagram with Octave Weight

Fig. 31. Accuracy depending on Octave Weight

We can see that octave weights are not a good weight ponderation and it is better to not

use it. Because we are giving more importance to the low notes and the FFT is less

precise in the low frequencies.

47

5.2 Classifier

a) Setting a baseline

In order to compare the accuracy of the classifiers, it is necessary to establish a baseline,

this baseline is defined using what is called a ZeroR classifier.

The ZeroR classifier consists in a classifier that classify all instances as the most

instanced class. In this case, there are a lot of classes, one per chord (24, 12 major

chords and 12 minor chords) so the baseline will be a low value. In fact, there are 25

classes because there is a class called none that corresponds to the silence parts. The

most instanced class is this class the baseline is the following:

Fig. 32. Baseline Precision, Recall and F-measure

b) The classifiers

I have tried 3 classifiers, excluding the ZeroR classifier, a NN (Neural Network), a

KNN (K-Nearest Neighbour) and a SVM (Support Vector Machine).

Our feature extractor returns as an output a 12-dimensional vector that each dimension

corresponds to a note, any chord has the same notes, so the chords are well distributed

in the 12-dimensional space.

To see the performance of the classifiers I have set the parameters to have a nice

frequency resolution, I didn’t take into account the chord changes over time, in the

48

frequency-time trade-off, I have specially focused in the frequency, so the results are

not representative of the real performance in a song. As long as the x increments, the

parameters are modified to give more precision.

Fig. 33. Precision of Classifiers

Fig. 34. Recall of Classifiers

49

Fig. 35. F-measure of Classifiers

51

6. DISCUSSION

6.1 Conclusion

For this project, determine an accuracy is not as simple as applying a formula, it takes

into account so many variables; the correct extraction of the features, the temporary

accuracy, the correct classification of the chords etc. So the best way to determine if it is

useful is looking it as a Musician, not as an engineer. In other words, trying the program

and seeing if it's a useful tool.

In the first place, there are basic errors like not considering the tempo in the chord

changes. The program changes the chords when it detects them, and this could work in

rubato pieces or ad libitum pieces etc. But usually, the musical pieces have a tempo, and

the harmonic rhythm is important. So It would be helpful to implement a beat tracking

system allowing the program to change the chords on time.

It's obvious that the program would have a little delay, but the intern loops of the

musical piece would make the musician understand and predict some chords so the

delay is not that important as long as the chords change on time.

Another basic error is not taking into account the enharmonies. Ab Major is the same as

G# Major in the Tempered System. The system would return Ab Major always as a

system, and this is musically wrong. In the E Major tonality, there is a G# Minor, and it

isn't a Ab Minor because the A is natural and Major in the E Major tonality, but the

program would return Ab Minor and it is musically wrong although it will sound right,

so the notes have a different name but they are the same (same frequency). Or it doesn't

make sense if an Ab major is followed by a C# minor, because the Ab major it's clearly

a G# major because it's the dominant of the C# minor.

Another basic error is not taking into account the harmonic relation between chords.

The chords are not randomly chosen in a song. They preserve a harmonic relation

between themselves and between them and the tonality. So if the tonality of the song is

in E Major, the probability of having a G minor is very low. In the other hand, the

probability of having a B major is so high, because it's the fifth grade in the scale.

The last basic error is not considering the internal harmonic loops of a song. What

creates a song is the main structure with the harmonic progression of each section of

this structure, different sections can have the same harmonic progressions. The

interesting thing is that usually the harmonic loop has a static size, it could be 4 bars, 8,

12 (blues for example) or higher (jazz standards).

Implementing these considerations would make the program better for sure, in fact, they

are considerations that the musicians take into account.

Despite this, the Chromagram works perfectly as a feature extractor (99% f-measure in

the database K folding). The problem is not determining the chord; it is determining the

chord in the right time. The feature extractor has a big time lapse: the window size and

the accumulative windows. So, when it comes to transcribing the harmony of a song,

the program has some issues when the chords are changing, but it recognizes them right.

The Real-time implementation works not as good as the Offline implementation,

because there is noise in the Analog-to-Digital conversion due to a bad Audio Interface.

I have used the Audio Interface of the computer, in fact, I have used another computer

to recreate a mic using a jack cable. So I have performed a Digital-to-Audio followed

52

by an Audio-to-Digital conversion, both performed with a bad audio Interface, so the

noise is considerable and it affects in the detection of chords, because it causes random

peaks that the programs counts them as notes.

With a good audio interface, this would not be a problem, but the majority of devices

have not a good audio interface (smart phones, PC etc.) so it will be interesting to

implement some kind of noise reduction algorithm.

6.2 Future Work There are many ideas to improve and implementations for future work. These ideas are

specially focused in musical aspects, like the harmonic progression.

The next step is to, in addition to recognizing the chord, recognizing the tonal function

of the chord. Some mistakes are worse than others, confusing two chords that have the

same tonal function it’s not a big problem.

The weights function makes some chords easier to classify than others, depending on

the height. So modifying this function and making constant all along the octave.

This program doesn’t work at all in detuned songs, it grabs the A4 as 440 Hz as a

reference, but some songs are in 432 Hz or just a detuned instrument. So, an

improvement would be making the reference variable (in some regions).

The program has a peak threshold, if the peak is below this threshold, it won’t be

computed as peak. So in some cases, depending on the mic, the position of the mic, the

volume etc. the result could be bad, so implementing a variable threshold depending on

the RMS would be a solution.

Another improvement would be a real-time implementation of the classifier. In the

Offline part the KNN was clearly the best solution to classify the chords, so

implementing a real-time KNN instead of a NN in Wekinator, would be an

improvement.

53

Bibliografia

1. Fujishima, T. (1999). Real-time chord recognition of musical sound: A system

using common lisp music. Proc. ICMC, Oct. 1999, 464-467.

2. Gómez, E., & Herrera, P. (2004). Automatic extraction of tonal metadata from

polyphonic audio recordings. In AES.

3. Harte, C., Sandler, M., & Gasser, M. (2006, October). Detecting harmonic

change in musical audio. In Proceedings of the 1st ACM workshop on Audio and

music computing multimedia (pp. 21-26). ACM

4. Goto, M. (2001). An audio-based real-time beat tracking system for music with

or without drum-sounds. Journal of New Music Research, 30(2), 159-171.

5. Goto, M., & Hayamizu, S. (1999, August). A real-time music scene description

system: Detecting melody and bass lines in audio signals. In Working Notes of

the IJCAI-99 Workshop on Computational Auditory Scene Analysis (pp. 31-40).

6. Gómez, E. (2006). Tonal description of polyphonic audio for music content

processing. INFORMS Journal on Computing, 18(3), 294-304.

7. Lee, K. (2006, November). Automatic Chord Recognition from Audio Using

Enhanced Pitch Class Profile. In ICMC.

8. Gómez, E. (2006). Tonal description of music audio signals. Department of

Information and Communication Technologies.

9. Serra, J., & Gómez, E. (2007). A cover song identification system based on

sequences of tonal descriptors. Music Information Retrieval Evaluation

eXchange (MIREX).

10. Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y. (2013). How to construct deep

recurrent neural networks. arXiv preprint arXiv:1312.6026.

11. Lee, H., Pham, P., Largman, Y., & Ng, A. Y. (2009). Unsupervised feature

learning for audio classification using convolutional deep belief networks.

In Advances in neural information processing systems (pp. 1096-1104).

12. Hamel, P., & Eck, D. (2010, August). Learning features from music audio with

deep belief networks. In ISMIR (Vol. 10, pp. 339-344).

13. Pohle, T., Schnitzer, D., Schedl, M., Knees, P., & Widmer, G. (2009, October).

On Rhythm and General Music Similarity. In ISMIR (pp. 525-530).]

[Aucouturier, J. J., & Pachet, F. (2002, October). Music similarity measures:

What's the use? In ISMIR (pp. 13-17).

14. Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall

PTR.

15. Wang, L. (Ed.). (2005). Support vector machines: theory and applications (Vol.

177). Springer Science & Business Media.

16. Cover, T. M., & Hart, P. E. (1967). Nearest neighbour pattern classification.

IEEE transactions on information theory, 13(1), 21-27.

55

I. https://github.com/danielgirotfg/Guitar-Chords-DB

II. https://github.com/leozimmerman/ofxAudioAnalyzer

III. https://www.upf.edu/web/mtg/hpcp

IV. https://code.soundsoftware.ac.uk/projects/tipic

V. https://es.mathworks.com/matlabcentral/fileexchange/35330-frequency-to-

note

VI. https://visualstudio.microsoft.com/es/?rr=https%3A%2F%2Fwww.google.co

m%2F

VII. https://www.cs.waikato.ac.nz/ml/weka/downloading.html

VIII. https://www.nayuki.io/page/free-small-fft-in-multiple-languages

IX. https://github.com/claydergc/find-peaks

X. https://www.mathworks.com/help/matlab/calling-matlab-engine-from-cpp-

programs.html

XI. https://essentia.upf.edu/documentation

XII. https://openframeworks.cc

XIII. https://github.com/paulreimer/ofxAudioFeatures

XIV. http://www.opennn.net/documentation

XV. http://www.wekinator.org

XVI. http://opensoundcontrol.org/introduction-osc

https://github.com/danielgirotfg/Guitar-Chords-DB

https://github.com/leozimmerman/ofxAudioAnalyzer

https://www.upf.edu/web/mtg/hpcp

https://code.soundsoftware.ac.uk/projects/tipic

https://es.mathworks.com/matlabcentral/fileexchange/35330-frequency-to-note

https://es.mathworks.com/matlabcentral/fileexchange/35330-frequency-to-note

https://visualstudio.microsoft.com/es/?rr=https%3A%2F%2Fwww.google.com%2F

https://visualstudio.microsoft.com/es/?rr=https%3A%2F%2Fwww.google.com%2F

https://www.cs.waikato.ac.nz/ml/weka/downloading.html

https://www.nayuki.io/page/free-small-fft-in-multiple-languages

https://github.com/claydergc/find-peaks

https://www.mathworks.com/help/matlab/calling-matlab-engine-from-cpp-programs.html

https://www.mathworks.com/help/matlab/calling-matlab-engine-from-cpp-programs.html

https://essentia.upf.edu/documentation

https://openframeworks.cc/

https://github.com/paulreimer/ofxAudioFeatures

http://www.opennn.net/documentation

http://www.wekinator.org/

http://opensoundcontrol.org/introduction-osc

chord identiication in real-time using a neural network

Documents