薅羊毛之讯飞输入法助我生成字幕文件

编程入门 行业动态 更新时间:2024-10-11 07:27:54

项目概述

由于近期免费的语音转文件/字幕的网站开始收费了,我作为个人用户,实在不想为计算付费,遍寻开源语音识别项目不到,后来看到了孙亖的csdn博客,就仿照他的思路,完成了本次项目。

大致思路就是把语音分段,每段短语音输入到讯飞输入法里,然后把文字拷贝,用来生成srt字幕文件

所用工具

1.MUMU安卓模拟器

2.讯飞输入法(安卓版本)

3.大象笔记本(从安卓模拟器里的应用宝下载的)

4.pyautogui

5.pydub

6.python3.6

成果展示

python 脚本根据指定的音频文件,生成srt字幕文件。

演示:使用自制脚本自动生成字幕

实现步骤

1.分割音频

使用pydub来分割音频,具体安卓过程要参考pydub的安装介绍部分,包括安装音频回放功能,设置ffmpeg,设置完环境变量还要重启电脑一次。

from pydub import AudioSegment
from pydub.silence import detect_nonsilent
from pydub.playback import play
import time, datetime
import os


# convert millisecond to '00:00:39,770'
def format_time(ms):
    # ms = millisecond%1000
    # second = millisecond/1000
    # hour = millisecond/
    td = datetime.timedelta(milliseconds= ms)
    return str(td).replace('.',',')


# audio dir
file_path = "4. 雅思模考卷1S4.mp3"
file_suffix = os.path.splitext(file_path)[-1][1:]
print("file path:",file_path,"suffix",file_suffix)

sound = AudioSegment.from_file(file_path, file_suffix)

time.sleep(0.5)
print("start")
# adapt parameter
idx = 0
min_silence_len = 500
previous_end = 0
timestamp_list = detect_nonsilent(sound, min_silence_len, sound.dBFS * 1.3, 10)
for i in range(len(timestamp_list)):
    d = timestamp_list[i][1] - timestamp_list[i][0]
    a = timestamp_list[i][0]
    b = timestamp_list[i][1]
    # input index and timestamp
    idx +=1
    # input index and timestamp
    index_time = '{1} --> {2}'.format(idx, format_time(a), format_time(b))
    print(index_time, "duration is:", d,'ms')
    # soft the voice, add the period which is around the threshold
    start = max(0,a-min_silence_len/2,previous_end)
    if i == len(timestamp_list)-1:
        end = min(len(sound),b+min_silence_len)
    else:
        end = min(timestamp_list[i+1][0],b+min_silence_len)
    play(sound[start: end])
    time.sleep(2)
    previous_end = b
print('dBFS: {0}, max_dBFS: {1}, duration: {2}, split: {3}'.format(round(sound.dBFS,2),round(sound.max_dBFS,2),
                                                                   sound.duration_seconds,len(timestamp_list)))

print('audio time:',str(datetime.timedelta(milliseconds=len(sound)) ) )

关键函数为:

detect_nonsilent(audio_segment, min_silence_len=1000, silence_thresh=-16, seek_step=1):
音频分割函数


该函数返回分割好的声音片段,audio_segment表示要处理的声音,min_silence_len表示每次处理的声音段的长度,单位ms,silence_thresh表示小于该阈值的声音段会被认为静音,单位为dBFS,是个负数,seek_step表示两次处理的时间段的间隔。

该函数会把min_silence_len长度内的声音计算均方根,然后和silence_thresh比较,如果小于该阈值,则认为该段声音为静音声段,把声音段向后滑动seek_step,继续计算声音段,判断是否静音。把静音的声音段都找出来了,那么整段声音也就裁好了。

min_silence_len越小,声音段被裁分的越多,silence_thresh越大,被裁分的声音段数量越多。

2.操作键鼠自动化

使用pyautogui来实现鼠标的点击操作,用它来点击讯飞输入法的语音输入键。

import pyautogui
import time
import pyperclip

space_loc = (239, 927)
time.sleep(2)
print("start")
print("current location:",pyautogui.position())

pyautogui.moveTo(space_loc[0],space_loc[1])
time.sleep(1)
pyautogui.click()
pyautogui.press('a')
# pyautogui.click()
pyperclip.copy('要输入的汉字')  # 先复制a要输入的汉字Hello world!a要输入的汉字Hello world!
pyperclip.paste()
pyautogui.mouseDown()
time.sleep(4)
pyautogui.mouseUp()

当然要使用其中的

print("current location:",pyautogui.position())

来打印出语音输入键的位置。

这里注意,要使用管理员权限启动pycharm才能够点击鼠标。最好不要启动类似lingos的取词软件,以防发生剪切板冲突

3.合成生成srt文件

'''
function:conver the audio to text
note:
1.start pycharm with acministrator role
2.put MUMU on the left top.
3.set input device to stereo in Windows
4.open ifly input method in MUMU in advance
requirement:
1.MUMU Android simulator
2.install yinxiang and ifly input method in MUMU
3.install pydub refer https://github/jiaaro/pydub
4.pip install pyautogui
limited:
1.audio length should less than 24 hours
2.lingos app and some other copy words app would cause typing error
'''
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
from pydub.playback import play
import pyautogui
import pyperclip
import time, datetime
import os


# convert millisecond to '00:00:39,770'
def format_time(ms):
    # ms = millisecond%1000
    # second = millisecond/1000
    # hour = millisecond/
    td = datetime.timedelta(milliseconds= ms)
    return str(td).replace('.',',')


# 模拟器位于左上时,键位的坐标
space_loc = (239, 927)
enter_loc = (484, 936)
edit_loc = (310, 607)
select_all_loc = (150, 918)
cut_loc = (374, 919)
back_loc = (479, 922)
# audio dir
file_path = "2. 雅思模考卷1S2.mp3"
file_suffix = os.path.splitext(file_path)[-1][1:]
print("file path:",file_path,"suffix",file_suffix)
# write a file
srt_file = os.path.splitext(file_path)[0]+'.srt'
f = open(file=srt_file, mode="w",encoding='utf8')
sound = AudioSegment.from_file(file_path, file_suffix)

start_time = time.localtime()
print("start",time.strftime('%H:%M:%S',start_time))

idx = 0
min_silence_len = 500
previous_end = 0
timestamp_list = detect_nonsilent(sound, 500, sound.dBFS * 1.3, 10)
for i in range(len(timestamp_list)):
    d = timestamp_list[i][1] - timestamp_list[i][0]
    a = timestamp_list[i][0]
    b = timestamp_list[i][1]
    # srt file's index
    idx +=1
    # soft the voice, add the period which is around the threshold
    start = max(0, a - min_silence_len / 2, previous_end)
    if i == len(timestamp_list) - 1:
        end = min(len(sound), b + min_silence_len)
    else:
        end = min(timestamp_list[i + 1][0], b + min_silence_len)
    previous_end = b
    # input index and timestamp
    index_time = '{0}\n{1} --> {2}\n'.format(idx, format_time(start), format_time(end))
    # press space
    pyautogui.moveTo(space_loc[0], space_loc[1])
    pyautogui.mouseDown()
    time.sleep(0.05)
    play(sound[start: end])
    time.sleep(0.05)
    pyautogui.mouseUp()
    time.sleep(0.5)
    # cut
    delay_time = 1 #second
    pyautogui.click(edit_loc[0], edit_loc[1])
    time.sleep(delay_time)
    pyautogui.click(select_all_loc[0], select_all_loc[1])
    time.sleep(delay_time)
    pyautogui.click(cut_loc[0], cut_loc[1])
    time.sleep(delay_time)
    pyautogui.click(back_loc[0], back_loc[1])
    time.sleep(delay_time)
    text = pyperclip.paste()
    f.write(index_time+text+'\n')
    print("Section is :", timestamp_list[i], "duration is:", d,'text:',text)
f.close()
# end
end_time = time.localtime()
print('end',time.strftime('%H:%M:%S',end_time),'processing time:',
      format_time(1000*(time.mktime(end_time)-time.mktime(start_time) ) ),
      'audio time:',str(datetime.timedelta(milliseconds=len(sound)) ) )

结语

一顿操作猛如虎,转换时长150%,总共花费的时间比音频总时长还要长50%,这是第一个缺陷,音频分段的效果感觉还有点缺陷,字幕前后两句有时会重叠在一起,这个以后可稍微修改一下。

看上去是违规使用讯飞输入法,但其实效率很低,况且我只是个人用户,没有太多的音频要处理,不会对输入法造成太大的影响。

更多推荐

薅羊毛之讯飞输入法助我生成字幕文件

本文发布于:2023-06-13 21:02:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1405426.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:羊毛   字幕   输入法   文件

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!