whisper+speaker.diarization.3.1实现根据说话人转文本

见贤思齐 · 发表于 2024-9-10 12:57:33

主要目的是复盘一下自己的第一个本地部署的代码。起因是老师布置的任务，想实现一个有关于教育场景的进行语音转录的模型。任务交给了本小白......好吧硬着头皮上，这篇博客也主要是记录自己的遇见的各种问题，以及对一些代码的改进。需要的小伙伴可以借鉴。一,语音转文本模型的选择最开始去搜索语音转文本的模型主要是whisper、Kaldi、funASR。我主要是对whisper和funASR的代码进行实现。whisper模型有好几种，base、medium、large-v1、large-v2，large-v3.我选择的是large-v2，但其实medium的翻译就差不多了，可以满足日常的需要。（large-v3不知道为什么效果并不是很好，而且还会报错，当我看到大家普遍多觉得v3没有v2好使，所以毅然决然选择了v2）。funASR也不错，同时还会自动增添一些标定符号，看大家自己的选择。这是funASR模型的效果这是whisper-largev2效果（whisper和funASR之间的具体区别大家可以参考Python环境下的语音转文本：Whisper与FunASR-百度开发者中心(baidu.com)）这两个模型的实现比较简单，主要就是按照要求把环境配置好，把改装的库装好，其他没有什么比较棘手的问题。其中在funASR中有一个问题耗了我有一段时间——解决funars中没有'rich_transcription_postprocess‘的问题。其实解决的办法就是是更新funasr模型进行安装，但是之前由于在github上直接下载的模型并不是最新版本，放在pycharm中从而导致更新也没有办法，把之前下载的模型删除之后再运行就可以了。二.说话人分离-pyannote/speaker-diarization这次的问题主要是在说话人分离的实现上遇见了很多问题。貌似大部分都是基于pyannote/speaker-diarizatin模型实现的，大家可以找找有没有其他模型仍旧可以实现这个功能。至于pyannote/speaker-diarization模型主要是在huggingface上面进行下载。在实现这个模型的过程中就碰见比较多的问题，花费的时间比较久。1.下载的权限-huggingface中token的申请首先你必须要去huggingface中注册一个相关的账号进行token的申请，切记这里申请的token一定要是write类型的，不然后面就会出现“huggingface_hub.utils._errors.LocalEntryNotFoundError:AnerrorhappenedwhiletryingtolocatethefileontheHubandwecannotfindtherequestedfilesinthelocalcache.Pleasecheckyourconnection”（一定要注意，在这里我不知道耗了多久时间才解决，而且解决的办法如此简单）2.网络问题的解决虽然已经申请了token但是在网络下载上还是会出现问题。这里你可以参考以下这篇博客的内容。（其他的方法我大部分都试了，就这个比较好使）解决huggingface下载连接不稳定导致ConnectError问题-七三七3-博客园(cnblogs.com)3.测试的代码可以直接在huggingface上复制粘贴代码，或者在魔塔里，反正代码大差不差。（魔塔的实现代码有点抽象，个人建议直接在huggingface上搞，反正魔塔也是调用的huggingface）#instantiatethepipelinefrompyannote.audioimportPipelineimportosos.environ["HF_ENDPOINT"]="https://hf-mirror.com"pipeline=Pipeline.from_pretrained("pyannote/speaker-diarization-3.0",use_auth_token="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")#runthepipelineonanaudiofilediarization=pipeline("./yinp.mp3")print(type(diarization))print(diarization)##dumpthediarizationoutputtodiskusingRTTMformat#withopen("audio.rttm","w")asrttm:#diarization.write_rttm(rttm)三.whisper+pyannote/speaker-diarization的整合这里的整合代码是在网上找的，大家可以参考这篇博客【ASR代码】基于pyannote和whisper的语音识别代码_pyannoteai-CSDN博客这段整合代码在有些博客中单独做了一个库叫pyannote_whisper，大家可以直接把那句导包注释掉，然后直接复制粘贴整合的代码就好了（主要就是导包的问题）下面附上的代码是我自己改过的，大家可以根据自己的需要自己改。frompyannote.coreimportSegmentimportmodel.whispers.whisperaswhisperimportpickleimporttorchimporttimeimportosos.environ["HF_ENDPOINT"]="https://hf-mirror.com"fromzhconvimportconvertfrompyannote.audioimportPipelinefrompyannote.coreimportAnnotationfile="umz7m-x8yym.mp3"defget_text_with_timestamp(transcribe_res):print(transcribe_res)timestamp_texts=[]foritemintranscribe_res["segments"]:start=item["start"]end=item["end"]#text=convert(item["text"],'zh-cn').strip()text=item["text"]timestamp_texts.append((Segment(start,end),text))returntimestamp_textsdefadd_speaker_info_to_text(timestamp_texts,ann):spk_text=[]forseg,textintimestamp_texts:print(ann.crop(seg))spk=ann.crop(seg).argmax()spk_text.append((seg,spk,text))#print("spk_text是：",spk_text)returnspk_textdefmerge_cache(text_cache):sentence=''.join([item[-1]foritemintext_cache])spk=text_cache[0][1]start=round(text_cache[0][0].start,1)end=round(text_cache[-1][0].end,1)returnSegment(start,end),spk,sentencePUNC_SENT_END=['.','?','!',"。","？","！"]defmerge_sentence(spk_text):merged_spk_text=[]pre_spk=Nonetext_cache=[]forseg,spk,textinspk_text:ifspk!=pre_spkandlen(text_cache)>0:merged_spk_text.append(merge_cache(text_cache))text_cache=[(seg,spk,text)]pre_spk=spkelifspk==pre_spkandtext==text_cache[-1][2]:print(text_cache[-1][2])#print(text)continue#merged_spk_text.append(merge_cache(text_cache))#text_cache.append((seg,spk,text))#pre_spk=spkelse:text_cache.append((seg,spk,text))pre_spk=spkiflen(text_cache)>0:merged_spk_text.append(merge_cache(text_cache))returnmerged_spk_textdefdiarize_text(transcribe_res,diarization_result):timestamp_texts=get_text_with_timestamp(transcribe_res)spk_text=add_speaker_info_to_text(timestamp_texts,diarization_result)res_processed=merge_sentence(spk_text)#print("res_processeds是：",res_processed)#res_processed=spk_textreturnres_processed#defwrite_to_txt(spk_sent,file):#withopen(file,'w')asfp:#forseg,spk,sentenceinspk_sent:#line=f'{seg.start:.2f}{seg.end:.2f}{spk}{sentence}\n'#fp.write(line)#defformat_time(seconds):##计算小时、分钟和秒数#hours=seconds//3600#minutes=(seconds%3600)//60#seconds=seconds%60##格式化输出#returnf"{hours:02d}:{minutes:02d}:{seconds:02d}"if__name__=="__main__":sd_config_path="./speaker-diarization-3.1/config.yaml"asr_model=whisper.load_model("large-v2")asr_model.to(torch.device("cuda"))speaker_diarization=Pipeline.from_pretrained(sd_config_path,use_auth_token="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")speaker_diarization.to(torch.device("cuda"))#files=os.listdir("/root/autodl-tmp/Fun19/audios")#forfileinfiles:start_time=time.time()print(file)dialogue_path="./audios_txt/"+file.split(".")[0]+".pkl"audio="./audios_wav/"+fileasr_result=asr_model.transcribe(audio,initial_prompt="随便")asr_time=time.time()print("ASRtime:"+str(asr_time-start_time))diarization_result:Annotation=speaker_diarization(audio)final_result=diarize_text(asr_result,diarization_result)dialogue=[]forsegment,spk,sentinfinal_result:content={'speaker':spk,'start':segment.start,'end':segment.end,'text':sent}dialogue.append(content)#print("_______________________________")print("[%.2fs->%.2fs]%s%s"%(segment.start,segment.end,spk,sent))end_time=time.time()#print(file+"spendtime:"+str(end_time-start_time))#withopen(dialogue_path,'wb')asf:#pickle.dump(dialogue,f)#end_time=time.time()#print(file+"spendtime:"+str(end_time-start_time))四、反思+写在最后其实大家要是不想要这么麻烦可以直接调用科大讯飞的API，我感觉效果也挺好的，有各种选择，还能实现实时的转录，真的还不错。（但是老师不让我调用api，非得用开源的代码实现。）具体实现可以参考基于讯飞接口的语音识别（python）_python用webapi调用讯飞语音识别方法-CSDN博客其实效果最好的是通义听悟，毕竟是阿里做的效果真的非常好，他会对你上传的视频的语音进行分析，不仅能根据说话人进行转录（翻译的准确度非常高）而且还有大模型的辅助，真的强。如果只是想要对视频语音进行分析并没有什么其他要求的，强烈推荐通义听悟，真的很强。通义(aliyun.com)如果是要进行语音分析这里主要是在huggingface（Models-HuggingFace）和魔塔（模型库首页·魔搭社区(modelscope.cn)）上去找模型。至此，我的阶段性任务已经完成，主要是组里没有人搞语音，师兄师姐没法帮，只能自己摸索。老师说要慢慢优化，去搞懂他们是对语音的什么特征进行了提取。这就涉及到深度学习神经网络的相关知识了。我还不怎么会。慢慢学！最后，告诉自己也勉励大家。心态要好，问题只要存在就一定能解决。并且相信自己，可以有我们不会的，但是没有我们搞不会的！桥到船头自然直！

		自动登录	找回密码
密码			会员注册