【NLP】音频特征工程(librosa库)(1)

编程入门行业动态更新时间:2024-10-27 06:29:27

【NLP】音频<a href=https://www.elefans.com/category/jswz/34/1769701.html style= 特征工程(librosa库)(1)"/>

【NLP】音频特征工程(librosa库)(1)

基础
Librosa库安装
术语表
音频文件采样
频域信息
- 短时傅里叶变换(STFT)
特征提取
- 过零率(Zero Crossing Rate，ZCR)
- 频谱中心(Spectral Centroid)
- 频谱滚降点(Spectral Rolloff)
- MFCC(Mel-Frequency Cepstral Coef.)
接口
- 重采样
- 读取时长
- 读取采样率
- 音频写入
- 波形图
- 短时傅里叶逆变换(ISTFT)
- 幅度转dB
- 功率转dB
- - 功率谱案例
- 频率谱
- Mel滤波器组
- Mel Scaled频谱
- 提取Log-Mel Spectrogram特征
- MFCC系数
数据和代码下载
参考资料

基础

提取音频的特征对音乐的分类，预测以及推荐十分关键，MFCC特征提取工程流程图如下

音频信号(audio signal)是一个在时间，幅度和频率上的三维信号，声波有三个重要的参数：频率 ω 0 \omega_0 ω0，幅度 A n A_n An和相位 ψ n \psi_n ψn，从频域角度来看，音频信号就是不同频率，相位和波幅的信号叠加.

奈奎斯特采样定理说明：在进行模拟信号与数字信号的转换中，如果采样频率 f f f超过信号的最高频率 g g g的2倍时，采样后的数字信号可以完整保留原始信号中的信息，人类对声音的敏感区间在4000Hz左右，所以如果采样频率达到2*4000=8000Hz左右，原始信号的中的信息对于普通人而言是完美保留.

对于时长大小为k秒的音频文件，在8000Hz频率采样下进行16bit采样，得到采样文件的大小为
k ∗ 8000 ∗ 16 ( b i t ) = k ∗ 8000 ∗ 2 ( b y t e ) = k ∗ 8000 ∗ 2 / ( 2 20 ) ( M B ) k*8000*16(bit)=k*8000*2(byte)=k*8000*2/(2^{20})(MB) k∗8000∗16(bit)=k∗8000∗2(byte)=k∗8000∗2/(220)(MB)

Librosa库安装

使用librosa库处理音频文件，相比较Praat库，librosa的开发文档写得比较详细，对函数都有相应的解释，对非信号处理领域的开发者更加友好.

pip install librosa numpy sklearn tensorflow keras

术语表

Term	含义
sr	采样频率
hop_length	帧移
overlapping	连续帧之间的重叠部分
n_fft	窗口大小
spectrum	频谱
spectrogram	频谱图/语谱图
amplitude	振幅
mono	单声道
stereo	立体声

音频文件采样

读取音频接口如下

y, sr = librosa.load(path, sr, mono, offset, duration)
参数：
path 音频文件的路径
sr 采样频率，默认值为22050
mono bool类型，是否将信号转为单声道
offset float，以秒为单位偏移开始读取音频的时间点
duration float，以秒为单位设置加载音频的时间长度
输出：
y 音频时间序列
sr 音频的采样频率

对音频文件以sr=8000进行采样，这里进行前3分钟采样

import librosa
import matplotlib.pyplot as plt
from librosa import display as ddpath, filename='audio', 'aud_1'def load_file():info, sr = librosa.load('{}/{}.mp3'.format(path, filename), sr=8000, offset=0.0, duration=180)print(info.shape)print(sr)plt.figure(figsize=(10, 6))dd.waveplot(info, sr=sr)plt.show()

Fig-1：音频文件的时间-振幅图像

频域信息

音频的很多信息可以在频域上得到，使用傅里叶变换进行处理

短时傅里叶变换(STFT)

对于周期信号，可以使用傅里叶变换将时域信息转换到频率，关于STFT部分的讲解可以见视频The Short Time Fourier Transform | Digital Signal Processing

STFT的基本思想是将信号进行滑动窗口处理，在sliding window中的信号进行傅里叶变换，得到信号的时变频谱(spectrogram)

调用librosa.stft函数计算时变频谱

def stft_plot():X=librosa.stft(info)Xdb=librosa.amplitude_to_db(abs(X)) # 振幅转为分贝plt.plot(figsize=(10, 6))dd.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')plt.colorbar()plt.show()stft_plot()

Fig-2：STFT，横轴为时间，纵轴为频率，颜色表示分贝（声音的强度），越接近红色的部分音频的振幅（音量）越大

stft函数接口说明

M=librosa.stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, pad_mode='reflect')
参数：
y， 音频时间序列
n_fft， FFT窗口大小，n_fft=hop_length(帧移)+overlapping(重叠)
hop_length， 帧移，如果未指定默认值为win_length/4
win_length， 每一帧用window滑窗处理，窗口长度为win_length，使用0填充匹配n_fft，默认值win_length=n_fft
window，  string, tuple, 数字，函数shape=(n_fft, )窗口（string，tuple, 数字）窗函数，如scipy.signal.hanning长度为n_fft的向量或者数组
center， bool类型，True表示填充信号y，使得帧D[:, t]以y[t*hop_length]为中心；False表示帧D[:, t]以y[t*hop_length]为起点
dtype，D的复数值类型，默认为64bit的complex复数
pad_mode，当center=True时，在信号边缘使用填充模式，默认情况使用reflection padding
返回：
STFT矩阵，shape=(1+nfft/2, t)

特征提取

过零率(Zero Crossing Rate，ZCR)

The zero crossing rate is the rate of sign-changes along a signal. i.e., the rate at which the signal changes from positive to negative or back. This feature has benn used heavily in both speech recognition and music information retrieval. It usually has higher values for highly percussive sounds like those in metal and rock.

过零率表示在每帧中，信号通过零点的次数

# 过零率
def ZCR_plot():start, end=1300, 1500plt.figure(figsize=(10, 6))plt.plot(info[start: end])plt.grid(True)plt.show()ZCR_plot()

计算过零点的数量

n_zcr=librosa.zero_crossings(info[start: end], pad=False)
print('# of ZCR is {}'.format(sum(n_zcr)))

过零率计算，函数librosa.feature.zero_crossing_rate()接口如下

librosa.feature.zero_crossing_rate(y, frame_length=2048, hop_length=512, center=True)
参数：
y，音频时间序列
frame_length，帧长
hop_length，帧移
center，bool，True通过填充y的边缘使得帧居中
返回：
zcr，zcr[0, i]表示第i帧中的过零率

print('ZCR is')
print(librosa.feature.zero_crossing_rate(info))

频谱中心(Spectral Centroid)

It indicates where the center of mass for a sound is located and is calculated as the weighted mean of the frequencies present in the sound. If the frequencies in music are same throughout then spectral centroid would be around a center and if there are high frequencies at the end of sound then the centroid would be towards its end.

频谱中心表示声音的质心，即频谱一阶矩，中心的位置表示该段频谱的能量集中在该频段附近

# 频谱中心
import sklearn
def spec_center():x=info[:80000] # 取80000/8000=10秒的数据spec_centroids=librosa.feature.spectral_centroid(x, sr=sr)[0]frames=range(len(spec_centroids))t=librosa.frames_to_time(frames, sr=8000) # 时间轴# 归一化处理def normalize(x, axis=0): return sklearn.preprocessing.minmax_scale(x, axis=axis)    dd.waveplot(x, sr=sr, alpha=0.4, label='wave')plt.plot(t, normalize(spec_centroids), color='r', linewidth=1, linestyle=':', label='spec_center')plt.legend()plt.show()spec_center()

spectral_centroid计算每一帧的频谱中心
frames_to_time将帧转换为时间time[i]==frame[i]

频谱滚降点(Spectral Rolloff)

Spectral rolloff is the frequency below which a specified percentage of the total spectral energy, e.g. 90% lies.

设置频率点 f f f，低于 f f f的频谱能量占总能量的比例达到了设定值，如90%或者85%(经验值)
arg min ⁡ f c ∈ { 1 , … , N } ∑ i = 1 f c m i ≥ 0.85 × ∑ i = 1 N m i \argmin_{f_c\in\{1, \dots, N\}}\sum_{i=1}^{f_c} m_i\geq 0.85\times\sum_{i=1}^Nm_i fc∈{1,…,N}argmini=1∑fcmi≥0.85×i=1∑Nmi
其中 f c f_c fc为滚降点频率， m i m_i mi为该频率下的能量(magnitude)分量.

# 滚降点
def spec_rolloff():x=info[:160000]# 前20秒spec_roll=librosa.feature.spectral_rolloff(x, sr=sr)[0]frames=range(len(spec_roll))t=librosa.frames_to_time(frames, sr=8000) # 时间轴# 归一化处理def normalize(x, axis=0): return sklearn.preprocessing.minmax_scale(x, axis=axis)   dd.waveplot(x, sr=sr, alpha=0.4, label='wave')plt.plot(t, normalize(spec_roll), color='r', linewidth=1, linestyle=':', label='spec_roll')plt.show()spec_rolloff()

MFCC(Mel-Frequency Cepstral Coef.)

The feature is one of the most important method to extract a feature of an audio signal and is used majorly whenever working on audio signals. The mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10-20) which concisely describe the overall shape of a spectral envelope.（谱包络）

MFCC是重要的音频信号特征，属于集合特征，可以表示频谱的包络.

# MFCC
def mfcc_plot():x=info[:160000] # 采样前20秒mfccs=librosa.feature.mfcc(x, sr=sr)print(mfccs.shape)dd.specshow(mfccs, sr=sr, x_axis='time')plt.show()mfcc_plot()

得到mfccs.shape为(20, 313)表示mfcc每帧有20维特征，帧数为313

接口

重采样

重采样从orig_sr到target_sr的时间序列

y_hat = libsora.resample(y, orig_sr, target_sr, fix=True, scale=False)
参数：
y，音频时间序列，可以为单声道或者立体声
orig_sr，y的原始采样率
target_sr，目标采样率
fix，bool，调整采样信号的长度，使其大小恰好为len(y)/orig_sr*target_sr=t*target_sr
scale，bool，缩放重新采样的信号，使得y和y_hat能量近似相等
返回：
y_hat，重采样之后的音频序列

读取时长

计算时间序列的持续时间(以秒为单位)

librosa.get_duration(y=None, sr=22050, S=None, n_fft=2048, hop_length=512, center=True, filename=None)
参数：
y，音频时间序列
sr，y的音频采样率
S，STFT矩阵或者任何STFT衍生矩阵（色谱图或者梅尔图）
n_fft，S的FFT窗口大小
hop_length，S列之间的音频样本数
center，bool，True表示中心，False表示从起点
filename，从音频文件中计算持续时间
返回：
d，持续时间（以秒为单位）

读取采样率

librosa.get_samplerate(path)

音频写入

librosa.output.write_wav(path, y, sr, norm=False)

时间序列输出为wav文件

波形图

绘制波形的幅度包络线

libsora.display.waveplot(y, sr=22050, x_axis='time', offset=0.0, ax=None)
参数：
y，音频时间序列
sr，y的采样率
x_axis，str{'time', 'off', 'none'}或None
offset，水平偏移（以秒为单位）开始波形图

短时傅里叶逆变换(ISTFT)

短时傅里叶逆变换（ISTFT），将复数值D(f, t)频谱矩阵转为时间序列y，窗函数、帧移等参数与STFT相同

librosa.istft(stft_matrix, hop_length=None, win_length=None, window='hann', center=True, length=None)
参数：
stft_matrix，经过STFT之后的矩阵
hop_length，帧移，默认为win_length/4
window，字符串，元组，数字，函数或shape = (n_fft, )窗口（字符串，元组或数字）窗函数，例如scipy.signal.hanning长度为n_fft的向量或数组
center，bool如果为True，则假定D具有居中的帧如果False，则假定D具有左对齐的帧
length，如果提供，则输出y为零填充或剪裁为精确长度音频

幅度转dB

将幅度频谱转为dB标度频谱，对S取对数，逆函数为librosa.db_to_amplitude(S)

libsora.amplitude_to_db(S, ref=1.0)
参数：
S，输入幅度
ref，参考值，振幅abs(S)相对于ref进行缩放，20*log(S/ref)
返回：
S，单位为dB

功率转dB

将功率谱（幅度平方）转为分贝(dB)，逆函数为librosa.core.db_to_power(S)

librosa.core.power_to_db(S, ref=1.0)
参数：
S，输入功率
ref，参考值，振幅abs(S)相对于ref进行缩放，20*log(S/ref)
返回：
S，单位为dB

功率谱案例

def power_demo():y, sr = librosa.load(librosa.util.example_audio_file())S = np.abs(librosa.stft(y))# print(librosa.power_to_db(S**2))plt.figure()plt.subplot(2,1,1)dd.specshow(S**2, sr=sr, y_axis='log')plt.colorbar()plt.title('Power spectrogram')plt.subplot(2,1,2)# 相对于峰值功率计算dbdd.specshow(librosa.power_to_db(S**2, ref=np.max), sr=sr, y_axis='log', x_axis='time')plt.colorbar(format='%+2.0f dB')plt.title('Log Power spectrogram')plt.set_cmap('autumn')plt.tight_layout()plt.show()

频率谱

librosa.display.specshow(data, x_axis=None, y_axis=None, sr=22050, hop_length=512)
参数：data：要显示的矩阵sr ：采样率hop_length ：帧移x_axis 、y_axis ：x和y轴的范围频率类型'linear'，'fft'，'hz'：频率范围由FFT窗口和采样率确定'log'：频谱以对数刻度显示'mel'：频率由mel标度决定时间类型time：标记以毫秒，秒，分钟或小时显示。值以秒为单位绘制。s：标记显示为秒。ms：标记以毫秒为单位显示。所有频率类型均以Hz为单位绘制

# 频率谱
def freq_demo():y, sr=librosa.load(librosa.util.example_audio_file())plt.figure()D=librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)plt.subplot(2, 1, 1)dd.specshow(D, y_axis='linear')plt.colorbar(format='%+2.0f dB')plt.title('Linear freq. Power Spec.') # 线性频率功率谱plt.subplot(2, 1, 2)dd.specshow(D, y_axis='log')plt.colorbar(format='%+2.0f dB')plt.title('Log freq. Power Spec.') # 对数频率功率谱plt.show()

Mel滤波器组

librosa.filters.mel(sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm=1)
参数：
sr ：输入信号的采样率
n_fft ：FFT组件数
n_mels ：产生的梅尔带数
fmin ：最低频率（Hz）
fmax：最高频率（以Hz为单位）。如果为None，则使用fmax = sr / 2.0
norm：{None，1，np.inf} [标量]
如果为1，则将三角mel权重除以mel带的宽度（区域归一化）。否则，保留所有三角形的峰值为1.0
返回：
Mel变换矩阵

# Mel滤波器组
def mel_filter_demo():melfb=librosa.filters.mel(22050, 2048)plt.figure()dd.specshow(melfb, x_axis='linear')plt.ylabel('Mel filter')plt.title('Mel filter bank')plt.colorbar()plt.tight_layout()plt.show()

Mel Scaled频谱

librosa.feature.melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window='hann',
center=True, pad_mode='reflect', power=2.0)
如果提供了频谱图输入S，则通过mel_f.dot（S）将其直接映射到mel_f上。
如果提供了时间序列输入y，sr，则首先计算其幅值频谱S，然后通过mel_f.dot（S ** power）将其映射到mel scale上 。默认情况下，power= 2在功率谱上运行。参数：
y ：音频时间序列
sr ：采样率
S ：频谱
n_fft ：FFT窗口的长度
hop_length ：帧移
win_length ：窗口的长度为win_length，默认win_length = n_fft
window ：字符串，元组，数字，函数或shape =（n_fft, )
窗口规范（字符串，元组或数字）；看到scipy.signal.get_window
窗口函数，例如 scipy.signal.hanning
长度为n_fft的向量或数组
center：bool
如果为True，则填充信号y，以使帧 t以y [t * hop_length]为中心。
如果为False，则帧t从y [t * hop_length]开始
power：幅度谱的指数。例如1代表能量，2代表功率，等等
n_mels：滤波器组的个数 1288
fmax：最高频率
返回：
Mel频谱shape=(n_mels, t)

# Mel-Scaled
def mel_scaled_demo():y, sr = librosa.load(librosa.util.example_audio_file())# 方法一：使用时间序列求Mel频谱print(librosa.feature.melspectrogram(y=y, sr=sr))# array([[  2.891e-07,   2.548e-03, ...,   8.116e-09,   5.633e-09],#        [  1.986e-07,   1.162e-02, ...,   9.332e-08,   6.716e-09],#        ...,#        [  3.668e-09,   2.029e-08, ...,   3.208e-09,   2.864e-09],#        [  2.561e-10,   2.096e-09, ...,   7.543e-10,   6.101e-10]])# 方法二：使用stft频谱求Mel频谱D = np.abs(librosa.stft(y)) ** 2  # stft频谱S = librosa.feature.melspectrogram(S=D)  # 使用stft频谱求Mel频谱plt.figure(figsize=(10, 4))dd.specshow(librosa.power_to_db(S, ref=np.max),y_axis='mel', fmax=8000, x_axis='time')plt.colorbar(format='%+2.0f dB')plt.title('Mel spectrogram')plt.tight_layout()plt.show()

提取Log-Mel Spectrogram特征

由于CNN等图像神经网络的强势表现，音频信号的频谱特征使用甚至超过了MFCC

# Log Mel Spec.
def log_mel_spec_demo():y, sr = librosa.load('audio/aud_1.mp3', sr=8000, duration=180)# 提取mel spec. featuremelspec=librosa.feature.melspectrogram(y, sr, n_fft=1024, hop_length=512, n_mels=128)logmelspec=librosa.amplitude_to_db(melspec)print(logmelspec.shape) #(128, 2813)

Log-Mel Spectrogram特征是二维数组，128表示Mel频率维度（频域），2813为时间帧长度（时域），所以Log-Mel Spectrogram特征是音频的视频特征，接口中
n_fft表示窗口大小
hop_length表示相邻窗口之间的距离
n_mels表示mel bands的数量

MFCC系数

MFCC特征是在自动语音识别和发言者识别中的广泛使用特征

librosa.feature.mfcc(y=None, sr=22050, S=None, n_mfcc=20, dct_type=2, norm='ortho', **kwargs)
参数：
y，音频数据
sr，采样率
S，np.ndarray，对数功能梅尔谱图
n_mfcc，需要返回的MFCC数量
dct_type，None, or {1, 2, 3} 离散余弦变换(DCT)类型，默认使用2型
norm，None or 'ortho'规范，如果dct_type为2或3，设置norm='ortho'使用正交DCT基础，标准化不支持dct_type=1
返回：
M：MFCC序列

def mfcc_demo():y, sr=librosa.load('audio/aud_1.mp3', sr=8000, duration=180)mfccs=librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)print(mfccs.shape) # (40, 2813)