如何使用Beautifulsoup4和Python 3在网页上废弃YouTube成绩单(How to web scrap youtube transcripts with Beautifulsoup4

编程入门行业动态更新时间:2024-10-28 10:35:38

如何使用Beautifulsoup4和Python 3在网页上废弃YouTube成绩单(How to web scrap youtube transcripts with Beautifulsoup4 and Python 3)

这是我目前的代码。我不确定我做错了什么。也许我没有深入挖掘HTML，并给Beautifulsoup正确的标签？目前，我的代码正在返回空白。

from bs4 import BeautifulSoup from urllib.request import urlopen html = urlopen("https://www.youtube.com/watch?v=5_zrHZdhaBU") soup = BeautifulSoup(html,'html.parser') nameList = soup.findAll("div", {"id": "cp-2"}) for name in nameList: print(name.get_text())

这是我检查的代码。我试图让Python回到我的身边，“但它被取消了”

<div id="cp-2" class="caption-line" data-time="7.54"><div class="caption-line-time">0:07</div><div class="caption-line-text">but it was untucked.</div></div>

***编辑

点击分享按钮旁边的“更多”即可找到代码。然后你点击成绩单，你会看到所有的文字。

Here is my current code. I am not sure what I am doing wrong. Maybe I am not digging deep enough in the html and giving Beautifulsoup the right tags? At the moment, my code is returning me blanks.

Here is the code that I inspected. I'm trying to get Python to return back to me "but it was untucked"

<div id="cp-2" class="caption-line" data-time="7.54"><div class="caption-line-time">0:07</div><div class="caption-line-text">but it was untucked.</div></div>

***Edit

The code can be found by clicking on "more" next to the share button. Then you click on transcripts and you will see all the text there.

最满意答案

哦，是的，它通过Ajax加载：打开页面，然后打开Network选项卡，按开始时间对请求进行排序（最先请求），点击Youtube上的CC按钮。

你得到api/timedtext请求，响应是一个XML。在这里它的成绩单的完整网址：

https://www.youtube.com/api/timedtext?signature=1A03D323CBD455E9993B7AC447CA64764FA6FE75.59F4BD2D45A32E89FBF54B418EE2F763283A1007&asr_langs=fr%2Cja%2Cnl%2Ces%2Cru%2Cko%2Cit%2Cde%2Cpt%2Cen&key=yttt1&caps=asr&v=5_zrHZdhaBU&hl=en_US&expire=1480702409&sparams= asr_langs％2Ccaps％2CV％2Cexpire＆郎= EN＆FMT = srv3

不过，我不知道这个URL是如何生成的。这需要对复杂的YouTube脚本进行调查等。

编辑： 这个答案帮助了我。您可以省略大部分这些参数，只需使用以下URL即可：

https://www.youtube.com/api/timedtext?&v=5_zrHZdhaBU&lang=en

或者一般来说：

https://www.youtube.com/api/timedtext?&v={video_id}&lang={language_code}

Oh yes, it's loaded via Ajax: open the page, then open Network tab, sort requests by start time (latest requests first), click CC button on Youtube.

You get api/timedtext request, the response is an XML. Here it the full url to the transcript:

I have no idea how this URL is generated, though. This requires invesigation of complex YouTube scripts, etc.

EDIT: This answer helped me. You can omit most of these parameters and just use this URL:

https://www.youtube.com/api/timedtext?&v=5_zrHZdhaBU&lang=en

Or this in general:

https://www.youtube.com/api/timedtext?&v={video_id}&lang={language_code}