我试图使用Python从NBA.com中抓取数据,但是在运行我的代码(如下所示)后等待一段合理的时间后,我没有收到响应。
import requests import json url_front = 'http://stats.nba.com/stats/leaguedashplayerstats?College=&' + \ 'Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&' + \ 'DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&' + \ 'Location=&MeasureType=' url_back = '&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&' + \ 'PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&' + \ 'PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&' + \ 'SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&' + \ 'VsConference=&VsDivision=&Weight=' #measure_type = ['Base','Advanced','Misc','Scoring','Opponent','Usage','Defense'] measure_type = 'Base' address = url_front + measure_type + url_back # Request the URL, then parse the JSON. response = requests.get(address) response.raise_for_status() # Raise exception if invalid response. data = response.json() # JSON decoding.到目前为止,我试图从博客文章( 此处 )和/或本网站上发布的类似性质的问题( Python , R )中重现代码,但每次都会得到相同的结果 - 代码实际上并不实际成功从网址中提取任何内容。
由于我是网络抓取的新手,我希望能够协助解决问题 - 这对于客户端呈现网站(NBA.com)常见,还是表示我的代码/计算机出现问题? 无论哪种情况,是否有常见的解决方法/解决方案?
I am attempting to scrape data from NBA.com using Python, but I do not receive a response after waiting for a reasonable amount of time when I run my code (shown below).
import requests import json url_front = 'http://stats.nba.com/stats/leaguedashplayerstats?College=&' + \ 'Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&' + \ 'DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&' + \ 'Location=&MeasureType=' url_back = '&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&' + \ 'PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&' + \ 'PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&' + \ 'SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&' + \ 'VsConference=&VsDivision=&Weight=' #measure_type = ['Base','Advanced','Misc','Scoring','Opponent','Usage','Defense'] measure_type = 'Base' address = url_front + measure_type + url_back # Request the URL, then parse the JSON. response = requests.get(address) response.raise_for_status() # Raise exception if invalid response. data = response.json() # JSON decoding.So far, I have attempted to reproduce code from blog posts (here) and/or questions posted on this site (Python, R) that are similar in nature, but I end up with the same result each time - the code does not actually succeed in pulling anything from the URL.
Since I am new to web scraping, I was hoping for assistance with troubleshooting the issue - is this common to sites with client-side rendering (NBA.com), or is it indicative of an issue with my code/computer? In either case, are there common workarounds/solutions?
最满意答案
如果你访问浏览器中的链接,你会注意到它工作正常。 原因是浏览器和requests具有不同的用户代理标题,并且该站点专门阻止了HTTP请求,这些HTTP请求看起来不像来自浏览器,因为它们不想被刮掉。 你可以绕过这样的:
response = requests.get(address, headers={ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0', })记住这一点,不要超载他们的服务器。
If you visit the link in your browser you'll notice it works fine. The reason is that the browser and requests have different user agent headers, and the site specifically blocks HTTP requests that don't look like they come from browsers because they don't want to be scraped. You can bypass this like so:
response = requests.get(address, headers={ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0', })Keep this in mind and don't overload their servers.
更多推荐
发布评论