防止我的 PHP Web Crawler 停止运行

编程入门 行业动态 更新时间:2024-10-28 05:13:03
本文介绍了防止我的 PHP Web Crawler 停止运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时送ChatGPT账号..

我正在使用 PHPCrawl 类并添加了一些 DOMDocument 和 DOMXpath 代码以从网页中获取特定数据,但是脚本在它甚至接近抓取整个网站之前就停止了.

I'm using the PHPCrawl class and added some DOMDocument and DOMXpath code to take specific data off web pages however the script stalls out before it gets even close to crawling the whole website.

我已将 set_time_limit 设置为 100000000,所以这应该不是问题.

I have set_time_limit set to 100000000 so that shouldn't be an issue.

有什么想法吗?

谢谢,尼克

<?php

// It may take a while to crawl a site ...
set_time_limit(100000000);

// Inculde the phpcrawl-mainclass
include("classes/phpcrawler.class.php");

//connect to the database
mysql_connect('localhost','#####','#####');
mysql_select_db('ft2');

// Extend the class and override the handlePageData()-method
class MyCrawler extends PHPCrawler 
{
  function handlePageData(&$page_data) 
  {
    // Here comes your code.
    // Do whatever you want with the information given in the
    // array $page_data about a page or file that the crawler actually found.
    // See a complete list of elements the array will contain in the 
    // class-refenence.
    // This is just a simple example.

    // Print the URL of the actual requested page or file
    echo "Page requested: ".$page_data["url"]."<br>";

    // Print the first line of the header the server sent (HTTP-status)
    //echo "Status: ".strtok($page_data["header"], "\n")."<br>";

    // Print the referer
    //echo "Referer-page: ".$page_data["referer_url"]."<br>";

    // Print if the content was be recieved or not
    /*if ($page_data["received"]==true)
      echo "Content received: ".$page_data["bytes_received"]." bytes";
    else
      echo "Content not received";
    */
    // ...

    // Now you should do something with the content of the actual
    // received page or file ($page_data[source]), we skip it in this example

    //echo "<br><br>";
    echo str_pad(" ", 5000); // "Force flush", workaround
    flush();

 //this is where we tear the data apart looking for username and timestamps
 $url = $page_data["url"];
 $html = new DOMDocument(); 
 $html->loadHTMLFile($url);

 $xpath = new DOMXpath($html);

 //children of ol id=posts
 $links = $xpath->query( "//li[@class='postbit postbitim postcontainer']" ); 

 foreach($links as $results){
  $newDom = new DOMDocument;
  $newDom->appendChild($newDom->importNode($results,true));

  $xpath = new DOMXpath ($newDom);
  $time_stamp = substr($xpath->query("div/div/span/span")->item(0)->nodeValue,0,10);
  $user_name = trim($xpath->query("div/div[2]/div/div/div/a/strong/font")->item(0)->nodeValue);

  $return[] = array(
   'time_stamp' => $time_stamp,
   'username' => $user_name,
   );
 }

 foreach ($return as $output) {
  echo "<strong>Time posted: " . $output['time_stamp'] . " by " . $output['username'] . "</strong>";
  //make your database entry
  $time_stamp = $output['time_stamp'];
  list($month, $day, $year) = split('[/.-]', $time_stamp);
  $time_stamp = $year."-".$month."-".$day;
  echo $time_stamp;

  $username = $output['username'];
  $sql="INSERT INTO lovesystems VALUES ('$username','$url','$time_stamp')";
  if (mysql_query($sql)) echo "Successfully input user in database!<br/>";
  else echo mysql_error();
 }
  }
}

// Now, create an instance of the class, set the behaviour
// of the crawler (see class-reference for more methods)
// and start the crawling-process.

$crawler = &new MyCrawler();

// URL to crawl
$crawler->setURL("http://######");

// Only receive content of files with content-type "text/html"
// (regular expression, preg)
$crawler->addReceiveContentType("/text\/html/");

// Ignore links to pictures, dont even request pictures
// (preg_match)
$crawler->addNonFollowMatch("/.(jpg|gif|png)$/ i");

// Store and send cookie-data like a browser does
$crawler->setCookieHandling(true);

// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
//$crawler->setTrafficLimit(1000 * 1024);

// Thats enough, now here we go
$crawler->go();




// At the end, after the process is finished, we print a short
// report (see method getReport() for more information)

$report = $crawler->getReport();

echo "Summary:<br>";
if ($report["traffic_limit_reached"]==true)
  echo "Traffic-limit reached <br>";

echo "Links followed: ".$report["links_followed"]."<br>";
echo "Files received: ".$report["files_received"]."<br>";
echo "Bytes received: ".$report["bytes_received"]."<br>";

?>

推荐答案

检查服务器的配置.我很确定 Apache 在它的配置中有一个脚本超时.

Check your server's configuration. I'm pretty sure Apache has a script timeout in it's configuration.

这篇关于防止我的 PHP Web Crawler 停止运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

更多推荐

[db:关键词]

本文发布于:2023-04-26 16:25:52,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1139277.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:PHP   Web   Crawler

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!