从网页抓取表格

编程入门行业动态更新时间:2024-10-27 11:22:22

本文介绍了从网页抓取表格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在尝试从此网页提取csu员工薪水数据( www.sacbee/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento ).我尝试使用urlib2和请求库，但没有一个从网页返回实际表.我猜想原因可能是该表是由javascript动态生成的.下面是我使用请求的代码.

I'm trying to extract csu employee salary data from this webpage (www.sacbee/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned the actual table from the webpage. I guessed the reason could be that the table was generated dynamically by javascript. Below is my code using requests.

from lxml import html import requests page = requests.get("www.sacbee/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento") tree = html.fromstring(page.text) name = tree.xpath('//table/tbody/tr/td[2]/text()'

任何帮助/评论将不胜感激.

Any help/comments will be highly appreciated.

推荐答案

根据我的评论，这是我的尝试.请注意，我只提取了一行数据.所有其他一切都取决于您.

Here's my attempt on it, as per my comment. Note that I only pulled out one line of data. All else is up to you.

代码:

import requests as rq url = "api.sacbeelabs/v1/statepay/employee/search/name=/year=2013/department=CSU%20Sacramento.json" data = "74XoegZ494trsvrus_As4B4handjZ494-Adl4B4olg494dnnk933pppAmWYXaaAYjh3mnWnakWq3-Ela-B-Oahkgjqaa07tw8tJmaWlYd07tw8tJiWha07tw8uH07tw8tJqaWl07tw8uHtrsu07tw8tJZakWlnhain07tw8uHGT-107tw8trTWYlWhainj4B4labalal494dnnk933mnWYfj-8albgjpAYjh3-Boamnejim3tt_v_rt_3YlWpgeic1nWXgam1bljh1paXkWca4B4nenga494TnWnaDVjlfalDTWgWlqDTaWlYdD1DUdaDTWYlWhainjDFaaBDTWYlWhainjBDGWgebjlieW4B4mYlV49sxzrB4mYlL49srwrB4peiV49sxzrB4peiL49_stB4oW4974Wcain494Oj-CeggW3wArD-I-6ss-MD-1Xoino-MDNeio-AD-Azx2xv-MDl-89tzAr-JDKaYfj3trsrrsrsDJelabj-A3tzAr4B4njoYd49bWgmaB4Zjh4954mnjlWca4B4WiehWneji4B4YWi-8WmtZ4B4paXmjYfan4B4pjlfal4B4WoZej4B4-8eZaj4B4m-8c4B4cajgjY46B4Ymm4954WiehWneji4B4nlWimbjlh468B4omal4974Woi494Koamn488" headers = { 'Host': 'api.sacbeelabs', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'X-SBAPI-Auth-Token': '0QNWbefXw6fQQcWXqK8vDw', 'X-SBAPI-SID': '3gbRqglHXAVDy1vwdcVVMf', 'X-SBAPI-CID': '2HuWho39ZcDUlTswYSWUd9', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'Referer': 'www.sacbee/statepay/', 'Content-Length': '684', 'Origin': 'www.sacbee', 'Cookie': 'sbapi-cid=2HuWho39ZcDUlTswYSWUd9; sbapi-sid=3gbRqglHXAVDy1vwdcVVMf', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache' } r = rq.post(url, data=data, headers=headers) json_data = r.json() base = json_data["result"]["employees"][0] # First employee. name = base["name"] first_name = name["first"] last_name = name["last"] pay = base["pay"]["total"] title = base["title"] dept = base["department"] print first_name, last_name, pay, title, dept # Your turn here...

结果:

Clayton Abajian 9844 Lecturer - Academic Year CSU Sacramento [Finished in 0.9s]

更多推荐

从网页抓取表格

本文发布于:2023-06-08 11:10:34，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/608248.html