| . ___ __ __ ___ . __ __ __ __ __ __ | | |__ |__) |__) |__ ' /__` /__` / \ /\ |__) |__) / \ \_/ |___ | |___ | \ | \ |___ .__/ .__/ \__/ /~~\ | |__) \__/ / \ ⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅∽⋅∽⋅
I want to create a little program called Nachtigall, to make working with external music trackers easier. Its goal is to take a voice sample, detect the notes and print them as understandable strings. Cf. previous posts:
This posts walks through step by step the application of the plan and whatever problems we'll encounter.
First, we would be more comfy with a separate virtual environment for this little experiment. In a new directory, let's do the usual.
python3 -m venv .venv source .venv/bin/activate
Then let's install the library, as per the documentation.
pip install librosa
We'll all monolith this for the moment. First let's see if the example works. Thank you Brian McFee for writing it, by the way. Fortunately, the python file is directly downloadable at the bottom of the documented example. Let's give it a spin.
python plot_audio_playback.py [...] ModuleNotFoundError: No module named 'matplotlib' [...]
Easily fixable, matplotlib is standard stuff for plots.
pip install matplotlib
Rerunning it yields:
ModuleNotFoundError: No module named 'IPython'
IPython is also pretty standard for notebooks and such.
pip install IPython
ModuleNotFoundError: No module named 'mir_eval'
Sigh... Is there a requirement.txt or something in the taste? Nope. Ok let's continute pip installing bit by bit. But I don't know mir_eval though, how legit is it? From its page, it sounds pretty cool actually, even if I understand half the words. It has only 20 contributors on github, but has recent signs of life. Ah, and Brian McFee is a contributor apparently, neat. From the coded example, that's where we got the synthetisation.
pip install mir_eval
We execute it again, a file is downloaded to cache (the sound file). But nothing happens during execution. I check my sound settings, nothing. Let's delve in the file. Ah, given the IPython structure and all, I guess it expects to be in a notebook and therefore executed in a browser. We could do it all in a notebook. But I could also just write the sound file out. There seems to be a write_wav function.
librosa.output.write_wav("librosa_example_out.wav", y+y_clicks, sr)
And it fails. Because that function existed, but has been deprecated. Apparently, I/O stuff is now mostly delegated to PySoundFile.
pip install soundfile
To be used like this for wav output.
import soundfile as sf sf.write("librosa_example_out.wav", y+y_clicks, sr, subtype='PCM_24')
And it works! We get the example trumpet sounds with clicks when we switch notes. Encouraging. Now let's boil all that code to the minimum we need, ensure that the example still works, then start tweaking things.
# ISC license applicable, see license statement at bottom of the post. import numpy as np import librosa import soundfile as sf y, sr = librosa.load(librosa.ex('trumpet')) f0, voiced_flag, voiced_probs = librosa.pyin(y, sr=sr, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'), fill_na=None) # Compute the onset strength envelope, using a max filter of 5 frequency bins # to cut down on false positives onset_env = librosa.onset.onset_strength(y=y, sr=sr, max_size=5) # Detect onset times from the strength envelope onset_times = librosa.onset.onset_detect(onset_envelope=onset_env, sr=sr, units='time') # Sonify onset times as clicks y_clicks = librosa.clicks(times=onset_times, length=len(y), sr=sr) sf.write("librosa_example_out.wav", y+y_clicks, sr, subtype='PCM_24')
I noticed that the original example does not use the mir_eval synthetised sound and uses the original sound instead. That synthetised sound seemed to involve some unwanted frequency suppression. Given that I will use crappy microphone recordings, I thought that I should go with the synthetized sound instead of the original for the onset_detection, in the hope that it could work as a "good-enough" filtering against background noise and the like.
However, running the example again after adapting it, it messes up the onset detection completely, which is not what I would have expected. And that's with a clean sound! Therefore, I got rid of it all. Premature optimization is really a bad idea most of the time.
I replaced the example trumpet sounds with some voice recordings. The onset detection works quite well, especially if the recording is cleaned from background noise with Audacity. Pitch slides however count as the same onset, which might make the original idea for the classification a bit naïve. Let's still implement it, as it would still prove better than nothing, see if it is useful as is, then iterate.
Now we want to slice the f0 in buckets corresponding to the held "notes", whose start and ose start and ends are contained the onset_times.
Printing outset_times to console yields values in seconds. Printing the lengths of the times and f0 variables results in same lengths. Therefore, our work is cut out for us. For each onset start and end, find the closest matching indices in times, then average the frequency for that range. Now, there are many things that can go wrong here (onset start equals onset end for instance), but we'll start naïvely and continue iteratively.
We have onset_times minus one notes. So let's loop over it, store the i_th and i_th+1 values as the start and ends. And let's be super duper naïve, we can iterate once over the times variable, and look for the entries that:
Probably easier to read as code:
# Store the start and end indices of the notes f0_indices_note_starts=-1*np.ones_like(onset_times[1:],int) f0_indices_note_ends =-1*np.ones_like(onset_times[1:],int) for i in range(len(onset_times)-1): onset_start=onset_times[i] onset_end =onset_times[i+1] for j in range(len(times-1)): is_start_found = f0_indices_note_starts[i] != -1 is_end_found = f0_indices_note_ends[i] != -1 if is_start_found and is_end_found: break if onset_start<=times[j+1] and times[j]<onset_start: f0_indices_note_starts[i] = j+1 if onset_end<=times[j+1] and times[j]<onset_end: f0_indices_note_ends[i] = j+1 assert not -1 in f0_indices_note_starts, f"Start indice detection issue, {f0_indices_note_starts}" assert not -1 in f0_indices_note_ends, f"End indice detection issue, {f0_indices_note_ends}" assert all(0<(f0_indices_note_ends-f0_indices_note_starts)), f"Start indices larger than end indices: start indices {f0_indices_note_starts} end indices {f0_indices_note_ends}"
Now all that's left is to extract the frequency values for the held notes, convert to notation and print.
# Extract the frequency ranges and convert to legible notes notes_as_str=[] for s,e in zip(f0_indices_note_starts,f0_indices_note_ends): note_in_hz = np.mean(f0[s:e+1]) notes_as_str.append(librosa.hz_to_note(note_in_hz)) print(f"{len(notes_as_str)} notes detected:") print(",".join(notes_as_str))
Seems to work well enough. That's where tests would come in handy, but given how quick we get feedback let's wait for a larger hurdle. In order to validate the thing, I record an exact sequence (C2,C3,C1,C#1,F1,E1), create a .wav, feed it to the program. And bummer, I get D2,G#2,C2,C2,C2. Something is amiss.
Ok, let's put the clicks back in our sound sample to control whether the onset detection is to blame.
# Sonify onset times as clicks y_clicks = librosa.clicks(times=onset_times, length=len(y), sr=sr) sf.write("librosa_example_out.wav", y+y_clicks, sr, subtype='PCM_24')
No, the onset detection is fine. Could it be that the pyin function should have its fmin and fmax changed? Maybe the fmin (by default at C2) is a bit high? Well, replacing it by C0 does change things but yields F2,G2,C1,D♯1,F1,G♯0, which is still pretty wonky.
One thing that I noticed is that when I visualised the sound with the clicks in audacity, the next onset start comes really at the start of the next note (as opposed at the end of the current note itself). This means that there is quite a bit of silence in there. Hence when it gets averaged, it probably isn't greatly representative of what the frequency of the note is.
Let's print out a couple of the frequency slices before averaging to see this in practice.
[ 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 66.93519325 73.84148765 81.93226047 90.909535 100.87044475 111.92276613 124.18608453 137.79308825 130.81278265] [130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 105.64102396 85.80717486 69.69708341 56.61162302 45.98292646 37.34974221 30.33741761 32.70319566] [32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 32.89264335 33.08318849 33.27483745 33.46759663 33.66147244 33.85647137 34.05259991 34.44827206 34.64782887] ...
If I convert 65.40639133, 130.81278265,32.70319566 to notes, I do get the expected C2,C3,C1. Which with the values above would suggest that an average is not a great way to aggregate that data. Basically, the tail of the note, where there is supposed to be silence, gets a frequency too, which shifts everything around.
But maybe our good old friends the unvoiced probabilities that messed up the onset detection can be used here to our benefit in the note classification.
notes_as_str=[] for s,e in zip(f0_indices_note_starts,f0_indices_note_ends): valid_frequencies=f0[s:e+1][voiced_flag[s:e+1]] print(valid_frequencies) note_in_hz = np.mean(valid_frequencies) notes_as_str.append(librosa.hz_to_note(note_in_hz))
This yields:
[ 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 65.40639133 66.93519325 137.79308825 130.81278265] [130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 130.81278265 30.33741761 32.70319566] [32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 32.70319566 34.64782887] [...] D♯2,A2,C1
Which is still wrong... but slightly better. Hmm. librosa.hz_to_note yields the closest note, according to the docs. So there must be a slight tolerance. How about converting the full array to notes, then getting the most frequent one? Rough but maybe good enough.
# Extract the frequency ranges and convert to legible notes notes_as_str=[] for s,e in zip(f0_indices_note_starts,f0_indices_note_ends): valid_frequencies=f0[s:e+1][voiced_flag[s:e+1]] sequence_as_str = librosa.hz_to_note(valid_frequencies) values, counts = np.unique(sequence_as_str, return_counts=True) most_frequent = np.argmax(counts) notes_as_str.append(values[most_frequent]
Yields:
C2,C3,C1,C♯1,F1,E1
Which is the sequence the device produced originally. Fantastic! Does it work with voiced samples? Throws an error, the valid_frequency gets nulled sometimes. The voiced_flag detection appears a bit too eager sometimes. Does not look super reliable. What happens if we get rid of that voiced_flag thing with our original, machine produced example?
C2,C3,C1,C♯1,F1,C0
Not as good with trailing notes...
Alright, let's leave it in then, and try a voiced sample with clear separation between notes.
Also not good. The voiced sample are apparently too quiet for the onset detection from the look of it. Can I clean that signal easily? Nope, even amplified and co, it does not appear to work reliably.
Alright, I'll catch a break and think on it. We're close to something workable.
librosa installation instructions
librosa audio playback example
Given that I technically have done a derivative of the existing software by remixing the example file, here is the applicable license. Many thanks to Brian McFee as well as all the other contributors for making my life easier, I appreciate it.
Copyright (c) 2013--2023, librosa development team. Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies. THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∼⋅∽⋅∽⋅∼⋅∽⋅∽⋅