For many years, in the field of audio, digital signal processing has reached peaks with regard to the development of “facilities” for a wide target audience, ranging from the “newbie” moved by the curiosity of the first acquaintances, to the polished user, to the studio sound engineer, to the musician and so on.
Recently, however, computing power (we thank the videogaming industry for this) is within reach of more or less than everyone, so Artificial Intelligence has been able to make great strides.
I have been involved professionally in software development since the beginning of the new millennium, in particular digital signal processing (DSP) for almost ten years. Music from… a long time. In recent years I have fused these passions (which are part of my work), combining the potential of AI, especially deep learning.
This is not intended to be an article that goes into too much technical detail, but more than just an introduction to how Deep Learning plays a “catalyst” role to achieve positive results in the field of “audio features extraction” and how, in particular, it is used in the tool that I developed and present to you.
This is the “Isolator” tool, which I developed with my software house.
The tool is being beta tested open to the public under invitation (to request access you can send an email on dev[AT]defined[DOT]tech).
Let’s first dismember the infrastructure on the canonical layers of any software industry today:
- Frontend — ReactJS
- Backend — Python
The frontend accepts an input file, necessarily in mp3 format (multiple compatible formats will be added in progress).
You can now choose to receive the isolation of your music or voice content.
By clicking on one of the respective buttons, the audio file is sent to the backend.
The operations that are performed by the backend, in order, are as follows:
- Cleaning the
audio stream The waveform is cleaned and equalized in order to discard “unnecessary” frequencies that could give rise to false positives, during inference
normalization There are various encoding methods that can be used for an audio stream and in general a stream of digital information. A waveform can be “discretionary” (that is, digitally represented) through multiple reference systems and conventions. As a rule, signal processing algorithms assume that the amplitude of a waveform is represented in the range [-1; 1]. The algorithms developed for Isolator, because of their intrinsic functioning, prefer that the signal be represented , at the level of amplitude according to time, by the interval [0,1]. The information of the amplitude according to the time will be used later, in the reconstruction of the signal
- Breaking the audio signal into tiny “micro signals” and Applying the FFT to get the” Cepstrum” (and other useful information) from each of them.
- With the Cepstrums of the “N” audio segments, we are able to bring out the “features” that will be compared with inference model, trained on a database of more than 1M samples, whose Cepstrum was used to catalog them as “containing vocal frequencies” or “not containing voice frequencies”. In addition, a “similarity matrix” is created between the Xn sample (piece of our input audio file) and the Yn sample (sample of the inference model) to also determine the frequencies, possibly, that are part of the voice content.
- After repeating this process, using an equalization, compression, and phase control pipeline, the voice content for each of the “small bits” of the input audio file is identified
- The newly identified voice signal is “subtracted” from the original waveform and other “N” (as many as micro samples of the input audio file) are created containing only the voice frequencies in question
- In a dedicated pipeline, the signal is further “restored” of some sound components that have failed during the extraction process. The original phase is rebalanced and the magnitude is normalized, along with other checks and any corrections of any kind.
- At this point, the signal must necessarily be reconstructed in the domain of time, then the Inverse Fourier Transform (IFFT) is applied.
- An outbound cleanup pipeline deletes any artifacts (aliasing) caused by the IFFT
- The samples (back in the dominoes of time) are then reunited to form the two files “music.wav” and “vocals.wav” that are sent back as a response to the frontend
- An automation system is about “feeding” the AI model with the new inferences made, so that it learns for itself
It is worth spending two words on the infrastructure used in order to develop the AI model.
I used Tensorflow, as a backend I chose Keras. For the initial dataset, I used approximately 1 million micro samples, using 80% of the dataset as a train and 20% as a test.
The convolutional neural network I created consists of 215 layers, each of which consists of 8 layers plus one final layer, with ReLU activation function. Using other activation functions, the end result is that the “detachment” felt during the entry and output of the vocal parts is much more raw and net. With ReLU you can feel less, on the other way, in songs with a lot of harmonic content, you can appreciate artifacts on medium-high frequencies (aliasing).
The maximum accuracy achieved on the initial test phase was 98.8%, not bad but definitely improveable.
At the moment I’m working to expand inference capabilities to other contexts beyond the vocal one, such as battery, bass, and other music layer insulation.
Isolator is in beta and will certainly undergo programmatic and implementation changes, as the underlying research is constantly evolving and growing.
For any information, I invite you to write on dev[AT]defined[DOT]tech
Greeting to all and good “Isolation” (of these times…)