来自httpwebresponse的部分页面源

编程入门 行业动态 更新时间:2024-10-28 14:32:49
本文介绍了来自httpwebresponse的部分页面源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我对此很陌生,所以请原谅任何无知.

我创建了我的第一个多线程应用程序,其目的是进行大量的Web请求,解析每个页面源,并将结果存储在表中以供进一步查询.从理论上讲,可能有多达30-40000个请求,因此需要多线程.每个请求都有一个线程.我认为一切正常,除了我经常只获得非常部分的页面源代码.几乎就像StreamReader在读取响应时被打断一样.我使用相同的请求进入浏览器,并获取整个页面.我认为可能与线程有关,尽管我仍在同步进行调用. (理想情况下,我想异步进行调用,但是我不确定该怎么做.)是否有一种方法可以知道页面源代码是否完整,以便确定是否再次请求?我确信这里缺少我的复杂性.任何代码的任何帮助将不胜感激.

很抱歉格式化.以下是发出请求的类的代码的一部分:

using System; using System.Collections.Generic; using System.Text; using System.Data.Sql; using System.Data.SqlClient; using System.Threading; using System.IO; using System.Net; using System.Text.RegularExpressions; namespace M4EverCrawler { public class DomainRun { public void Start() { new Thread(new ThreadStart(this.Run1)).Start(); new Thread(new ThreadStart(this.Run2)).Start(); new Thread(new ThreadStart(this.Run3)).Start(); } public DomainRun(DNQueueManager dnq, ProxyQueueManager prxQ) { dnqManager = dnq; ProxyManager = prxQ; } private DNQueueManager dnqManager; private ProxyQueueManager ProxyManager; public StagingQueue StagingQueue = new StagingQueue(); public MetricsQueueManager MQmanager = new MetricsQueueManager(); public CommitQueueManager CQmanager = new CommitQueueManager(); protected void Run1() { dnqManager.LoadDNs(); ProxyManager.LoadProxies(); while (true) { if (dnqManager.IsDNDavailable) { DomainData dnd = dnqManager.GetDND(); dnd.PageSource = CapturePage(dnd.DomainName); StagingQueue.AddDN2Q(dnd); } Thread.Sleep(new Random().Next(20)); } } protected void Run2() { while (true) { if (StagingQueue.IsDNDavailable) { DomainData dnd = StagingQueue.GetDND(); MaxOutboundLinks = 3; AvoidHttps = true; InsideLinks = false; VerifyBackLinks = true; MQmanager.AddDN2Q(ParsePage(dnd)); foreach (string link in dnd.Hlinks) { DomainData dndLink = new DomainData(dnd.MainSeqno,link.ToString()); dndLink.ParentDomainName = dnd.DomainName; dnd.PageSource = String.Empty; MQmanager.AddDN2Q(dndLink); } } Thread.Sleep(new Random().Next(20)); } } protected void Run3() { while (true) { if (MQmanager.IsDNDavailable) { DomainData dnd = MQmanager.GetDND(); RunAlexa(dnd); RunCompete(dnd); RunQuantcast(dnd); CQmanager.AddDN2Q(dnd, MQmanager, 1000); } Thread.Sleep(new Random().Next(20)); } } private string CapturePage(string URIstring) { Uri myUri; try { myUri = new Uri(URIstring); } catch (Exception URIex) { return String.Empty; } string proxyIP = ProxyManager.GetCurrentProxy() == "" ? ProxyManager.GetProxy() : ProxyManager.GetCurrentProxy(); int proxCtr = 0; HttpWebRequest request = (HttpWebRequest)WebRequest.Create(myUri); WebProxy Proxy = new WebProxy(proxyIP); request.Proxy = Proxy; request.Timeout = 20000; try { using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { using (StreamReader strmRdr = new StreamReader(response.GetResponseStream(), Encoding.ASCII)) { return strmRdr.ReadToEnd(); } } } catch (InvalidOperationException Wex) { . . . } }

解决方案

您正在使用具有ASCII编码的StreamReader.如果服务器发送的数据没有有效的ASCII编码,则StreamReader不会将数据正确写入字符串中.

请注意,服务器可能会在响应标头上显式地放置页面编码,或者在页面内容本身中使用META标记.

以下页面显示了如何使用正确的编码下载数据: blogs.msdn/feroze_daud/archive/2004/03/30/104440.aspx

很有可能您没有从服务器获取完整的实体主体,这可能是由于代理错误或其他原因造成的.

也许您可能想在您的应用程序中添加更多诊断信息.记录下载的#bytes和使用的代理.然后,您可以执行Encoding.ASCII.GetBytes(string).Length并确保它与下载的#bytes相同.如果不是,则页面编码有问题.如果不是这种情况,则说明路径上的代理服务器不正确.

希望这会有所帮助.

I am very new to this so please pardon any ignorance.

I have created my first multi-threaded application and it's purpose is to make numerous webrequests, parse each page source, and store the results in tables for further interrogation. Theoretically there could be as many as 30-40000 requests, therefore the need to multi-thread. Each request gets a thread. I think everything is working except that I very often only get a very partial page source. It's almost as if the StreamReader get interrupted while reading the response. I go to a browser with the same request and get the entire page. I thought it may have to do with threading although I think I am still making calls synchronously. (Ideally, I would like to do the calls asynchronously but I am not sure how to go about that.) Is there a way of knowing if the page source is complete in order to determine whether to request again? I am sure there are complexities here that I am missing. Any help on any of the code would be greatly appreciated.

Sorry about the formatting. Below is part of the code for the class that makes the requests:

using System; using System.Collections.Generic; using System.Text; using System.Data.Sql; using System.Data.SqlClient; using System.Threading; using System.IO; using System.Net; using System.Text.RegularExpressions; namespace M4EverCrawler { public class DomainRun { public void Start() { new Thread(new ThreadStart(this.Run1)).Start(); new Thread(new ThreadStart(this.Run2)).Start(); new Thread(new ThreadStart(this.Run3)).Start(); } public DomainRun(DNQueueManager dnq, ProxyQueueManager prxQ) { dnqManager = dnq; ProxyManager = prxQ; } private DNQueueManager dnqManager; private ProxyQueueManager ProxyManager; public StagingQueue StagingQueue = new StagingQueue(); public MetricsQueueManager MQmanager = new MetricsQueueManager(); public CommitQueueManager CQmanager = new CommitQueueManager(); protected void Run1() { dnqManager.LoadDNs(); ProxyManager.LoadProxies(); while (true) { if (dnqManager.IsDNDavailable) { DomainData dnd = dnqManager.GetDND(); dnd.PageSource = CapturePage(dnd.DomainName); StagingQueue.AddDN2Q(dnd); } Thread.Sleep(new Random().Next(20)); } } protected void Run2() { while (true) { if (StagingQueue.IsDNDavailable) { DomainData dnd = StagingQueue.GetDND(); MaxOutboundLinks = 3; AvoidHttps = true; InsideLinks = false; VerifyBackLinks = true; MQmanager.AddDN2Q(ParsePage(dnd)); foreach (string link in dnd.Hlinks) { DomainData dndLink = new DomainData(dnd.MainSeqno,link.ToString()); dndLink.ParentDomainName = dnd.DomainName; dnd.PageSource = String.Empty; MQmanager.AddDN2Q(dndLink); } } Thread.Sleep(new Random().Next(20)); } } protected void Run3() { while (true) { if (MQmanager.IsDNDavailable) { DomainData dnd = MQmanager.GetDND(); RunAlexa(dnd); RunCompete(dnd); RunQuantcast(dnd); CQmanager.AddDN2Q(dnd, MQmanager, 1000); } Thread.Sleep(new Random().Next(20)); } } private string CapturePage(string URIstring) { Uri myUri; try { myUri = new Uri(URIstring); } catch (Exception URIex) { return String.Empty; } string proxyIP = ProxyManager.GetCurrentProxy() == "" ? ProxyManager.GetProxy() : ProxyManager.GetCurrentProxy(); int proxCtr = 0; HttpWebRequest request = (HttpWebRequest)WebRequest.Create(myUri); WebProxy Proxy = new WebProxy(proxyIP); request.Proxy = Proxy; request.Timeout = 20000; try { using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { using (StreamReader strmRdr = new StreamReader(response.GetResponseStream(), Encoding.ASCII)) { return strmRdr.ReadToEnd(); } } } catch (InvalidOperationException Wex) { . . . } }

解决方案

You are using a StreamReader with an ASCII encoding. If the data being sent by the server does not have a valid ASCII encoding, then the StreamReader will not write the data correctly into the string.

Note that the server might be explicitly putting a page encoding on either the response headers, or using a META tag in the page content itself.

The following page shows you how to download data using the correct encodings: blogs.msdn/feroze_daud/archive/2004/03/30/104440.aspx

It is also possible that you are not getting the full entity body from the server, this could be due to a bad proxy, or something else.

Maybe you might want to add more diagnostics into your app. Log the #bytes downloaded, and the proxy used. Then you can do an Encoding.ASCII.GetBytes(string).Length and make sure that it is the same as the #bytes downloaded. if it is not, then you have a problem with page encodings. If that is not the case, then you have a bad proxy on the path.

Hope this helps.

更多推荐

来自httpwebresponse的部分页面源

本文发布于:2023-11-01 07:24:25,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1548654.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:页面   httpwebresponse

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!