使用请求进行Python Web抓取

编程入门行业动态更新时间:2024-10-25 06:22:48

本文介绍了使用请求进行Python Web抓取-登录后的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我下面有一个python请求/佳汤代码，可让我成功登录到url.但是，登录后，要获取我需要的数据，通常必须手动进行以下操作:

I have a python requests/beatiful soup code below which enables me to login to a url successfully. However, after logon, to get the data I need would normally have to manually have to:

1)单击第一行中的声明":

1) click on 'statement' in the first row:

2)选择日期，然后单击运行语句":

2) Select dates, click 'run statement':

3)查看数据:

这是我用来登录到上面的第1步的代码:

This is the code that I have used to logon to get to step 1 above:

import requests from bs4 import BeautifulSoup logurl = "login.flash.co.za/apex/f?p=pwfone:login" posturl = 'login.flash.co.za/apex/wwv_flow.accept' with requests.Session() as s: s.headers = {"User-Agent":"Mozilla/5.0"} res = s.get(logurl) soup = BeautifulSoup(res.text,"html.parser") arg_names =[] for name in soup.select("[name='p_arg_names']"): arg_names.append(name['value']) values = { 'p_flow_id': soup.select_one("[name='p_flow_id']")['value'], 'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'], 'p_instance': soup.select_one("[name='p_instance']")['value'], 'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'], 'p_request': 'LOGIN', 'p_t01': 'solar', 'p_arg_names': arg_names, 'p_t02': 'password', 'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'], 'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value'] } s.headers.update({'Referer': logurl}) r = s.post(posturl, data=values) print (r.content)

我的问题是，(初学者来说)如何跳过第1步和第2步，并简单地使用最终的URL使用选定的日期作为表单条目(下面的标题和表单信息)来更新和发布另一个标题? (referral header是上面的步骤2):

My question is, (beginner speaking), how could I skip steps 1 and 2 and simply do another headers update and post using the final URL using selected dates as form entries (headers and form info below)? (The referral header is step 2 above):

]

来自csv文件下载的网络请求:

Edit 1: network request from csv file download:

推荐答案

正如其他人所建议的那样，Selenium是完成此类任务的好工具.但是，我会尝试建议一种使用requests的方法，因为这是您在问题中所要的.

As others have recommended, Selenium is a good tool for this sort of task. However, I'd try to suggest a way to use requests for this purpose as that's what you asked for in the question.

这种方法的成功实际上取决于网页的构建方式和数据文件的提供方式(如果您所针对的是视图数据中的另存为CSV").

The success of this approach would really depend on how the webpage is built and how data files are made available (if "Save as CSV" in the view data is what you're targeting).

如果登录机制是基于Cookie的，则可以使用会话和 Cookies .当您提交登录表单时，响应头中将返回一个cookie.您可以在随后的任何页面请求中将Cookie添加到请求标头中，以使您的登录名保持不变.

If the login mechanism is cookie-based, you can use Sessions and Cookies in requests. When you submit a login form, a cookie is returned in the response headers. You add the cookie to request headers in any subsequent page requests to make your login stick.

此外，您还应该在开发人员工具网络窗格中检查网络请求中的另存为CSV"操作.如果您可以看到请求的结构，则可以在经过身份验证的会话中发出直接请求，并可以使用语句标识符和日期作为有效载荷来获取结果.

Also, you should inspect the network request for "Save as CSV" action in the Developer Tools network pane. If you can see a structure to the request, you may be able to make a direct request within your authenticated session, and use a statement identifier and dates as the payload to get your results.

更多推荐

使用请求进行Python Web抓取

本文发布于:2023-10-10 01:25:33，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1477309.html