如何将mfcc矢量与注释中的标签结合起来传递给神经网络(How to combine mfcc vector with labels from annotation to pass to a neur

如何将mfcc矢量与注释中的标签结合起来传递给神经网络(How to combine mfcc vector with labels from annotation to pass to a neural network)

使用librosa，我为我的音频文件创建了mfcc，如下所示：

import librosa y, sr = librosa.load('myfile.wav') print y print sr mfcc=librosa.feature.mfcc(y=y, sr=sr)

我还有一个文本文件，其中包含与音频对应的手动注释[start，stop，tag]，如下所示：

0.0 2.0声音1 2.0 4.0 sound2 4.0 6.0沉默 6.0 8.0 sound1

问题：如何将生成的librosa生成的mfcc与文本文件中的注释结合起来。

最终目标是，我想结合对应于标签的mfcc，并传递它到神经网络。因此，神经网络将mfcc和相应的标签作为训练数据。

如果它是一维的，我可以有N列N值，最后一列Y带有Class标签。但我很困惑如何继续，因为mfcc的形状类似于（16，X）或（20，Y）。所以我不知道如何将两者结合起来。

我的样本mfcc在这里： https ： //gist.github.com/manbharae/0a53f8dfef6055feef1d8912044e1418

请帮忙谢谢。

更新：目标是训练神经网络，以便在将来遇到它时识别出新的声音。

我用Google搜索，发现mfcc非常适合演讲。但是我的音频有语音，但我想识别非语音。是否有其他推荐的音频功能用于通用音频分类/识别任务？

Using librosa, I created mfcc for my audio file as follows:

import librosa y, sr = librosa.load('myfile.wav') print y print sr mfcc=librosa.feature.mfcc(y=y, sr=sr)

I also have a text file that contains manual annotations[start, stop, tag] corresponding to the audio as follows:

0.0 2.0 sound1 2.0 4.0 sound2 4.0 6.0 silence 6.0 8.0 sound1

QUESTION: How to do I combine the generated mfcc's that was generated by librosa, with the annotations from text file.

Final goal is, I want to combine mfcc corresponding to the label, and pass it to a neural network. So a neural network will have the mfcc and corresponding label as training data.

If it was one dimensional , I could have N columns with N values and the final Column Y with a Class label. But i'm confused how to proceed, as the mfcc has the shape of something like (16, X) or (20, Y). So I don't know how to combine the two.

My sample mfcc's are here : https://gist.github.com/manbharae/0a53f8dfef6055feef1d8912044e1418

Please help thank you.

Update : Objective is to train a neural network so that it can identify a new sound when it encounters it in the future.

I googled and found that mfcc are very good for speech. However my audio has speech but I want to indentify non speech. Are there any other recommended audio features for a general purpose audio classification/recognition task?

最满意答案

请尝试以下方法。解释包含在代码中。

import numpy import librosa # The following function returns a label index for a point in time (tp) # this is psuedo code for you to complete def getLabelIndexForTime(tp): # search the loaded annoations for what label corresponsons to the given time # convert the label to an index that represents its unqiue value in the set # ie.. 'sound1' = 0, 'sound2' = 1, ... #print tp #for debug label_index = 0 #replace with logic above return label_index if __name__ == '__main__': # Load the waveforms samples and convert to mfcc raw_samples, sample_rate = librosa.load('Front_Right.wav') mfcc = librosa.feature.mfcc(y=raw_samples, sr=sample_rate) print 'Wave duration is %4.2f seconds' % (len(raw_samples)/float(sample_rate)) # Create the network's input training data, X # mfcc is organized (feature, sample) but the net needs (sample, feature) # X is mfcc reorganized to (sample, feature) X = numpy.moveaxis(mfcc, 1, 0) print 'mfcc.shape:', mfcc.shape print 'X.shape: ', X.shape # Note that 512 samples is the default 'hop_length' used in calculating # the mfcc so each mfcc spans 512/sample_rate seconds. mfcc_samples = mfcc.shape[1] mfcc_span = 512/float(sample_rate) print 'MFCC calculated duration is %4.2f seconds' % (mfcc_span*mfcc_samples) # for 'n' network input samples, calculate the time point where they occur # and get the appropriate label index for them. # Use +0.5 to get the middle of the mfcc's point in time. Y = [] for sample_num in xrange(mfcc_samples): time_point = (sample_num + 0.5) * mfcc_span label_index = getLabelIndexForTime(time_point) Y.append(label_index) Y = numpy.array(Y) # Y now contains the network's output training values # !Note for some nets you may need to convert this to one-hot format print 'Y.shape: ', Y.shape assert Y.shape[0] == X.shape[0] # X and Y have the same number of samples # Train the net with something like... # model.fit(X, Y, ... #ie.. for a Keras NN model

我应该提到的是，这里的Y数据旨在用于具有softmax输出的网络，该输出可以用整数标签数据进行训练。 Keras模型接受了sparse_categorical_crossentropy损失函数（我相信损失函数在内部将其转换为单热编码）。其他框架要求Y训练标签以一热编码格式进行加法。这种情况比较常见。有很多关于如何进行转换的例子。对于你的情况，你可以做一些像......

Yoh = numpy.zeros(shape=(Y.shape[0], num_label_types), dtype='float32') for i, val in enumerate(Y): Yoh[i, val] = 1.0

至于mfcc是否可以接受非语音分类，我希望它们可以工作，但你可能想尝试修改它们的参数，即.. librosa允许你做一些像n_mfcc=40这样你得到40个特征而不是20个。有趣的是，您可以尝试使用相同大小的简单FFT（512个样本）替换mfcc，看看哪个效果最好。

Try the following. The explanation is included in the code.

import numpy import librosa # The following function returns a label index for a point in time (tp) # this is psuedo code for you to complete def getLabelIndexForTime(tp): # search the loaded annoations for what label corresponsons to the given time # convert the label to an index that represents its unqiue value in the set # ie.. 'sound1' = 0, 'sound2' = 1, ... #print tp #for debug label_index = 0 #replace with logic above return label_index if __name__ == '__main__': # Load the waveforms samples and convert to mfcc raw_samples, sample_rate = librosa.load('Front_Right.wav') mfcc = librosa.feature.mfcc(y=raw_samples, sr=sample_rate) print 'Wave duration is %4.2f seconds' % (len(raw_samples)/float(sample_rate)) # Create the network's input training data, X # mfcc is organized (feature, sample) but the net needs (sample, feature) # X is mfcc reorganized to (sample, feature) X = numpy.moveaxis(mfcc, 1, 0) print 'mfcc.shape:', mfcc.shape print 'X.shape: ', X.shape # Note that 512 samples is the default 'hop_length' used in calculating # the mfcc so each mfcc spans 512/sample_rate seconds. mfcc_samples = mfcc.shape[1] mfcc_span = 512/float(sample_rate) print 'MFCC calculated duration is %4.2f seconds' % (mfcc_span*mfcc_samples) # for 'n' network input samples, calculate the time point where they occur # and get the appropriate label index for them. # Use +0.5 to get the middle of the mfcc's point in time. Y = [] for sample_num in xrange(mfcc_samples): time_point = (sample_num + 0.5) * mfcc_span label_index = getLabelIndexForTime(time_point) Y.append(label_index) Y = numpy.array(Y) # Y now contains the network's output training values # !Note for some nets you may need to convert this to one-hot format print 'Y.shape: ', Y.shape assert Y.shape[0] == X.shape[0] # X and Y have the same number of samples # Train the net with something like... # model.fit(X, Y, ... #ie.. for a Keras NN model

I should mention that here the Y data is intended to be used in a network that has a softmax output that can be trained with integer label data. Keras models accept this with a sparse_categorical_crossentropy loss function (I believe the loss function internally converts it to one-hot encoding). Other frameworks require the Y training labels to be delivered alreading in one-hot encoding format. This is more common. There's lots of examples on how to do the conversion. For your case you could do something like...

Yoh = numpy.zeros(shape=(Y.shape[0], num_label_types), dtype='float32') for i, val in enumerate(Y): Yoh[i, val] = 1.0

As for mfcc's being acceptable for classifying non-speech, I would expect them to work but you may want to try modifying their parameters, ie.. librosa allows you do something like n_mfcc=40 so you get 40 features instead of just 20. For fun, you might try replacing the mfcc with a simple FFT of the same size (512 samples) and see which works the best.

更多推荐