如何使用Beautifulsoup4和Python 3在网页上废弃YouTube成绩单(How to web scrap youtube transcripts with Beautifulsoup4

编程入门 行业动态 更新时间:2024-10-28 10:35:38
如何使用Beautifulsoup4和Python 3在网页上废弃YouTube成绩单(How to web scrap youtube transcripts with Beautifulsoup4 and Python 3)

这是我目前的代码。 我不确定我做错了什么。 也许我没有深入挖掘HTML,并给Beautifulsoup正确的标签? 目前,我的代码正在返回空白。

from bs4 import BeautifulSoup from urllib.request import urlopen html = urlopen("https://www.youtube.com/watch?v=5_zrHZdhaBU") soup = BeautifulSoup(html,'html.parser') nameList = soup.findAll("div", {"id": "cp-2"}) for name in nameList: print(name.get_text())

这是我检查的代码。 我试图让Python回到我的身边,“但它被取消了”

<div id="cp-2" class="caption-line" data-time="7.54"><div class="caption-line-time">0:07</div><div class="caption-line-text">but it was untucked.</div></div>

***编辑

点击分享按钮旁边的“更多”即可找到代码。 然后你点击成绩单,你会看到所有的文字。

Here is my current code. I am not sure what I am doing wrong. Maybe I am not digging deep enough in the html and giving Beautifulsoup the right tags? At the moment, my code is returning me blanks.

from bs4 import BeautifulSoup from urllib.request import urlopen html = urlopen("https://www.youtube.com/watch?v=5_zrHZdhaBU") soup = BeautifulSoup(html,'html.parser') nameList = soup.findAll("div", {"id": "cp-2"}) for name in nameList: print(name.get_text())

Here is the code that I inspected. I'm trying to get Python to return back to me "but it was untucked"

<div id="cp-2" class="caption-line" data-time="7.54"><div class="caption-line-time">0:07</div><div class="caption-line-text">but it was untucked.</div></div>

***Edit

The code can be found by clicking on "more" next to the share button. Then you click on transcripts and you will see all the text there.

最满意答案

哦,是的,它通过Ajax加载:打开页面,然后打开Network选项卡,按开始时间对请求进行排序(最先请求),点击Youtube上的CC按钮。

你得到api/timedtext请求,响应是一个XML。 在这里它的成绩单的完整网址:

https://www.youtube.com/api/timedtext?signature=1A03D323CBD455E9993B7AC447CA64764FA6FE75.59F4BD2D45A32E89FBF54B418EE2F763283A1007&asr_langs=fr%2Cja%2Cnl%2Ces%2Cru%2Cko%2Cit%2Cde%2Cpt%2Cen&key=yttt1&caps=asr&v=5_zrHZdhaBU&hl=en_US&expire=1480702409&sparams= asr_langs%2Ccaps%2CV%2Cexpire&郎= EN&FMT = srv3

不过,我不知道这个URL是如何生成的。 这需要对复杂的YouTube脚本进行调查等。

编辑: 这个答案帮助了我。 您可以省略大部分这些参数,只需使用以下URL即可:

https://www.youtube.com/api/timedtext?&v=5_zrHZdhaBU&lang=en

或者一般来说:

https://www.youtube.com/api/timedtext?&v={video_id}&lang={language_code}

Oh yes, it's loaded via Ajax: open the page, then open Network tab, sort requests by start time (latest requests first), click CC button on Youtube.

You get api/timedtext request, the response is an XML. Here it the full url to the transcript:

https://www.youtube.com/api/timedtext?signature=1A03D323CBD455E9993B7AC447CA64764FA6FE75.59F4BD2D45A32E89FBF54B418EE2F763283A1007&asr_langs=fr%2Cja%2Cnl%2Ces%2Cru%2Cko%2Cit%2Cde%2Cpt%2Cen&key=yttt1&caps=asr&v=5_zrHZdhaBU&hl=en_US&expire=1480702409&sparams=asr_langs%2Ccaps%2Cv%2Cexpire&lang=en&fmt=srv3

I have no idea how this URL is generated, though. This requires invesigation of complex YouTube scripts, etc.

EDIT: This answer helped me. You can omit most of these parameters and just use this URL:

https://www.youtube.com/api/timedtext?&v=5_zrHZdhaBU&lang=en

Or this in general:

https://www.youtube.com/api/timedtext?&v={video_id}&lang={language_code}

更多推荐

本文发布于:2023-08-02 08:24:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1373242.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:成绩单   如何使用   网页   Python   YouTube

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!