webmagic+selenium+tesseract

编程入门 行业动态 更新时间:2024-10-18 10:36:26

<a href=https://www.elefans.com/category/jswz/34/1738380.html style=webmagic+selenium+tesseract"/>

webmagic+selenium+tesseract

👨🏻‍🎓博主介绍:大家好,我是芝士味的椒盐,一名在校大学生,热爱分享知识,很高兴在这里认识大家🌟
🌈擅长领域:Java、大数据、运维、电子
🙏🏻如果本文章各位小伙伴们有帮助的话,🍭关注+👍🏻点赞+🗣评论+📦收藏,相应的有空了我也会回访,互助!!!
🤝另本人水平有限,旨在创作简单易懂的文章,在文章描述时如有错,恳请各位大佬指正,在此感谢!!!
 


目录

WebMagic简介

Selenium简介

Tesseract-OCR简介

一、项目需求

二、技术可行性分析

三、技术实施


WebMagic简介

        webmagic是不需要配置,便捷数据挖掘的爬虫框架,其拥有简单且灵活的api。webmagic整体采用模块化架构,整个爬虫的生命周期:提取连接——>页面下载——>内容提取——>数据持久化,并且支持多线程挖掘,支持分布式挖掘,支持自动重试,自定义cookies,模块可定制化等功能。

Selenium简介

        selenium是一款遵守Apache License 2.0协议的开源框架,用于Web程序自动化测试工具,selenium测试运行在浏览器中,就像真的用户在操作一样,包括Firefox、Safari、Chrome、Opera等等。

Tesseract-OCR简介

        一款由HP实验室开发由Google维护的开源OCR引擎,与MODI相比,可以不断的训练的库,使图像转文本的能力不断增强。

一、项目需求

        总所周知,在数据挖掘领域,其中及其重要的就是数据的爬取,而在大数据时代的到来之后对数据量的需求更加的大,这迫使爬虫需要在短时间内爬取更多的数据,但是现在许多的网站都设置了反扒机制,最常见的反扒机制就是封锁段时间请求过多的ip,而解决方式之一就是使用代理服务器,通过请求道代理服务器,代理服务器去请求目的地站点,这样即使被封ip也是代理服务器被封锁,而我们通常没有那么多代理服务器,市面上有许多的代理服务商,比如我们今天要爬取的对象米扑代理,它虽然有收费的代理服务器,但是它的免费代理也是可以用的,而我们的任务就是爬去ip、端口、类型、匿名度、国家(省市)、运营商、响应时间、传输速度、验证日期等等。

项目成品gitee地址:mipuproxy: webmagic+selenium+tesseract-ocr实现米扑代理代理爬去

二、技术可行性分析

        先上图,

很显然除了端口之外还是比较好处理的,由于端口是一张图片,就联想到使用tesseract-ocr进行识别,为了可以直观的无阻碍的模拟人访问使用selenium进行辅助模拟人的操作,整体的爬虫系统使用基于Java编写的WebMagic实现。

三、技术实施

        项目整体使用SpringBoot工程化,如下图:

分层明确,dao层为数据访问层、entity为数据库实体、service服务层、以及webmagic的任务层 。

如下为本次项目maven所需的包的坐标:

    <properties><mavenpiler.source>8</mavenpiler.source><mavenpiler.target>8</mavenpiler.target><java.version>1.8</java.version></properties><dependencies><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-web</artifactId><exclusions><exclusion><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-tomcat</artifactId></exclusion></exclusions></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-devtools</artifactId><scope>runtime</scope><optional>true</optional></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-configuration-processor</artifactId><optional>true</optional></dependency><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><optional>true</optional></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-test</artifactId><scope>test</scope></dependency><dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-data-jpa</artifactId></dependency><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId></dependency>
<!--        <dependency>-->
<!--            <groupId>com.google.guava</groupId>-->
<!--            <artifactId>guava</artifactId>-->
<!--            <version>23.0</version>-->
<!--        </dependency>--><dependency><groupId>us.codecraft</groupId><artifactId>webmagic-core</artifactId><version>0.7.4</version></dependency><dependency><groupId>us.codecraft</groupId><artifactId>webmagic-extension</artifactId><version>0.7.4</version></dependency><dependency><groupId>org.apachemons</groupId><artifactId>commons-lang3</artifactId></dependency><dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-java</artifactId><version>3.141.59</version></dependency><dependency><groupId>us.codecraft</groupId><artifactId>webmagic-selenium</artifactId><version>0.7.4</version></dependency><dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-chrome-driver</artifactId></dependency><dependency><groupId>net.sourceforge.tess4j</groupId><artifactId>tess4j</artifactId><version>4.5.4</version><exclusions><exclusion><groupId>net.java.dev.jna</groupId><artifactId>jna</artifactId></exclusion><exclusion><groupId>net.sourceforge.lept4j</groupId><artifactId>lept4j</artifactId></exclusion></exclusions></dependency><dependency><groupId>net.java.dev.jna</groupId><artifactId>jna</artifactId><version>4.4.0</version></dependency><dependency><groupId>net.sourceforge.lept4j</groupId><artifactId>lept4j</artifactId><version>1.5.0</version></dependency><dependency><groupId>com.alibaba</groupId><artifactId>druid-spring-boot-starter</artifactId><version>1.2.6</version></dependency></dependencies><build><plugins><plugin><groupId>org.springframework.boot</groupId><artifactId>spring-boot-maven-plugin</artifactId><configuration><excludes><exclude><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId></exclude></excludes></configuration></plugin></plugins></build>

dao层接口:

package icu.smile.proxy.dao;import icu.smile.proxy.entity.ProxyMiPu;
import org.springframework.data.jpa.repository.JpaRepository;/*** <p>* dao层* </p>** @author starrysky* @since 2021/6/7*/public interface MiPuDao extends JpaRepository<ProxyMiPu, Long> {}

与数据库交互用的实体类:

package icu.smile.proxy.entity;import lombok.Data;
import lombok.experimental.Accessors;import javax.persistence.*;/*** <p>* 封装实体* </p>** @author starrysky* @since 2021/6/6*/
@Entity
@Table(name = "proxy_mipu")
@Data
@Accessors(chain = true)
public class ProxyMiPu {@Id@GeneratedValue(strategy = GenerationType.IDENTITY)private Long id;private String ip;private Integer port;private String type;private String anonymous;private String location;private String operator;private String responseTime;private String transmissionTime;private String verificationTime;}

service服务:

package icu.smile.proxy.service;import icu.smile.proxy.entity.ProxyMiPu;import java.util.List;/*** <p>* 服务层* </p>** @author starrysky* @since 2021/6/7*/
public interface MiPuService {void  save(ProxyMiPu proxyMiPu);List<ProxyMiPu> findAll(ProxyMiPu proxyMiPu);void saveAll(List<ProxyMiPu> entityList);
}
age icu.smile.proxy.service.impl;import icu.smile.proxy.dao.MiPuDao;
import icu.smile.proxy.entity.ProxyMiPu;
import icu.smile.proxy.service.MiPuService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.Example;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Propagation;
import org.springframework.transaction.annotation.Transactional;import javax.persistence.EntityManager;
import javax.persistence.PersistenceContext;
import java.util.List;
import java.util.logging.Logger;/*** <p>* TODO* </p>** @author starrysky* @since 2021/6/7*/
@Service
public class MiPuServiceImpl implements MiPuService {@Autowiredprivate MiPuDao miPuDao;@PersistenceContextprotected EntityManager entityManager;private static final Logger LOGGER = Logger.getLogger(MiPuServiceImpl.class.getName());@Override@Transactionalpublic void save(ProxyMiPu proxyMiPu) {ProxyMiPu miPum = new ProxyMiPu();miPum.setIp(proxyMiPu.getIp());miPum.setPort(proxyMiPu.getPort());List<ProxyMiPu> list = this.findAll(miPum);if (list.size()==0){this.miPuDao.saveAndFlush(proxyMiPu);}}@Override@Transactional(propagation = Propagation.REQUIRED)public void saveAll(List<ProxyMiPu> entityList){miPuDao.saveAll(entityList);}@Overridepublic List<ProxyMiPu> findAll(ProxyMiPu proxyMiPu) {Example example = Example.of(proxyMiPu);List<ProxyMiPu> list = this.miPuDao.findAll(example);return list;}
}

webmagic的task任务:

package icu.smile.proxy.task;import org.apachemons.lang3.StringUtils;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.logging.Logger;/*** <p>* 爬虫处理器* </p>** @author starrysky* @since 2021/6/6*/
@Component
public class MiPuPageProcessor implements PageProcessor {private static final Logger LOGGER = Logger.getLogger(MiPuPageProcessor.class.getName());private static String cmdPrefix = "tesseract ";private static String cmdSuffix = " stdout";private static Process process = null;private static BufferedReader bufferedReader = null;private static String ImageResultOCR = null;private Site site = Site.me().setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36").setDomain("proxy.mimvp").addCookie("UMIMVPSESSID", "hct11388aottkqvlgkoecsqna6").addCookie("Hm_lvt_51e3cc975b346e7705d8c255164036b3", "1622948301").addCookie("Hm_lpvt_51e3cc975b346e7705d8c255164036b3", "1622949348").setCharset("UTF-8").setTimeOut(5000).setRetrySleepTime(1000).setRetryTimes(3);/**** <p>*     总的处理方法* </p>* @author starrysky* @since 2021/6/6 22:14* @param    page    page页面* @return void  无返回值*/@Overridepublic void process(Page page) {final List<String> proxyTabes = proxyTabes(page);page.addTargetRequests(proxyTabes);page.putField("ip", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-ip']/text()").all());page.putField("port", proxyPort(page));page.putField("type", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-type']/text()").all());page.putField("anonymous", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-anonymous']/text()").all());page.putField("location", fixLocation(page));page.putField("operator", fixOperator(page));page.putField("responseTime", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-pingtime']/@title").all());page.putField("transmissionTime", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-transfertime']/@title").all());page.putField("verificationTime", page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-checkdtime']/text()").all());page.addTargetRequests(fixSmileUrl(page));page.addTargetRequests(fixListnav(page));}/**** <p>*     将page中的三个代理分类的url抽出* </p>* @author starrysky* @since 2021/6/6 22:11* @param    page    page页面* @return java.util.List<java.lang.String> 存储三个代理方式的分类url*/public List<String> proxyTabes(Page page) {List<String> tabes = new ArrayList<>();for (String value : page.getHtml().css("div.free-proxytype-tabs").xpath("//a/@href").all()) {tabes.add(page.getUrl().toString().substring(0, 23) + value);}return tabes;}/**** <p>*     将页面中的记录着代理端口的图片抽出* </p>* @author starrysky* @since 2021/6/6 22:12* @param    page    page页面* @return java.util.List<java.lang.Integer> 存储转换之后的端口*/public List<Integer> proxyPort(Page page) {List<Integer> protoPort = new ArrayList<>();for (String value : page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-port']//img/@src").all()) {protoPort.add(convertImageInteger(page.getUrl().toString().substring(0, 23) + value));}return protoPort;}/**** <p>*     将从图片中提取出来* </p>* @author starrysky* @since 2021/6/6 22:13* @param    url    每一个记录着端口的图片url* @return java.lang.Integer 端口号*/public synchronized Integer convertImageInteger(String url) {try {process = Runtime.getRuntime().exec(cmdPrefix + url + cmdSuffix);bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));ImageResultOCR = bufferedReader.readLine().replace(",", "").replace(".", "").replace("!", "").replace("@", "").replace("#", "").replace("$", "").replace("%", "").replace("^", "").replace("&", "").replace("*", "");//处理8080被识别成为B080情况if (ImageResultOCR.contains("B") && ImageResultOCR.length() != 3) {ImageResultOCR = ImageResultOCR.replace("B", "8");//处理三位的端口其本来为两位,去除被干扰的B,比如B80,这个方法有点风险后期会改进} else if (ImageResultOCR.contains("B") && ImageResultOCR.length() == 3) {ImageResultOCR = ImageResultOCR.replace("B", "");} else if (ImageResultOCR.contains("s") && ImageResultOCR.contains("e")) {ImageResultOCR = ImageResultOCR.replace("s", "5").replace("e", "2");} else if (ImageResultOCR.contains("s")) {ImageResultOCR = ImageResultOCR.replace("s", "5");} else if (ImageResultOCR.contains("e")) {ImageResultOCR = ImageResultOCR.replace("e", "2");}return (Integer) Integer.parseInt(ImageResultOCR);} catch (IOException e) {e.printStackTrace();} catch (Exception e) {ImageResultOCR = "0";LOGGER.info("OCR识别出错,将使用0填充端口项目.");}return null;}public List<String> fixLocation(Page page) {List<String> location = new ArrayList<>();for (String loc : page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-country']/text()").all()) {location.add(loc.replace("(", "").replace(")", "").trim());}return location;}/**** <p>*     修复运营商描述* </p>* @author starrysky* @since 2021/6/6 23:22* @param    page    page页面* @return java.util.List<java.lang.String> 返回运营商的描述*/public List<String> fixOperator(Page page) {List<String> operator = new ArrayList<>();for (String oper : page.getHtml().xpath("//td[@class='free-proxylist-tbl-proxy-isp']/text()").all()) {operator.add(oper == null ? "暂无运营商" : oper);}return operator;}/**** <p>*     处理列表页面* </p>* @author starrysky* @since 2021/6/6 23:23* @param        page page页面* @return java.util.List<java.lang.String> 列表页面*/public List<String> fixListnav(Page page) {List<String> listnv = new ArrayList<>();//判断列表寻址中包含...则需要根据收src和尾部src数字生成Listif (page.getHtml().css("div#listnav").css("ul").xpath("//li/text()").all().contains("...")) {//构造条件【/freesecret?proxy,in_hp&sort=&pag,1】final String[] origin = page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().get(0).toString().split("=");
//            【/freesecret?proxy,in_hp&sort=&pag]final String[] urlBody = Arrays.copyOf(origin,origin.length-1);final String[] firstElement = page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().get(0).toString().split("=");//获取1 这个idint firstId = Integer.parseInt(firstElement[firstElement.length-1]);final String[] lastElement = page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().get(page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all().size() - 1).toString().split("=");//获取最后一个48 这个idint lastId = Integer.parseInt(lastElement[lastElement.length-1]);
//            使用firstId、lastId构造for生成全部的urlfor (int i = firstId; i <= lastId ; i++) {listnv.add("/"+StringUtils.join(urlBody,"=")+"="+i);}return listnv;}//没有...说明页面少,直接提取生成完整urlfor (String url:page.getHtml().css("div#listnav").css("ul").xpath("//li//a/@href").all()){listnv.add("/"+url);}return listnv;}public List<String> fixSmileUrl(Page page){List<String> smileurl = new ArrayList<>();for (String url:page.getHtml().xpath("//div[@class='free-httptype-tabs']//a/@href").all()){smileurl.add("/"+url);}return smileurl;}@Overridepublic Site getSite() {return site;}
}
package icu.smile.proxy.task;import icu.smile.proxy.entity.ProxyMiPu;
import icu.smile.proxy.service.MiPuService;
import lombok.extern.java.Log;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;import java.util.ArrayList;
import java.util.List;
import java.util.logging.Logger;/*** <p>* 持久化层* </p>** @author starrysky* @since 2021/6/6*/
@Component
@Log
public class MiPuPipeline implements Pipeline {@Autowiredprivate MiPuService miPuService;private volatile static int curIndex = 0;private static final Logger LOGGER = Logger.getLogger(MiPuPipeline.class.getName());@Overridepublic void process(ResultItems resultItems, Task task) {final ProxyMiPu proxyMiPu = new ProxyMiPu();final List<String> ip = resultItems.get("ip");final List<Integer> port = resultItems.get("port");final List<String> type = resultItems.get("type");final List<String> anonymous = resultItems.get("anonymous");final List<String> location = resultItems.get("location");final List<String> operator = resultItems.get("operator");final List<String> responseTime = resultItems.get("responseTime");final List<String> transmissionTime = resultItems.get("transmissionTime");final List<String> verificationTime = resultItems.get("verificationTime");final List<ProxyMiPu> entityList = new ArrayList<>();if (ip.size() == 0 || port.size() == 0) {return;}for (int i = curIndex; i <= ip.size() - 1; i++) {proxyMiPu.setIp(ip.get(i)).setPort(port.get(i)).setType(type.size() == 0 ? "页面无类型描述" : type.get(i)).setAnonymous(anonymous.size() == 0 ? "页面无描述" : anonymous.get(i)).setLocation(location.size() == 0 ? "页面无地址描述" : location.get(i)).setOperator(operator.size() == 0 ? "页面无运营商描述" : operator.get(i)).setResponseTime(responseTime.size() == 0 ? "页面无响应时间描述" : responseTime.get(i)).setTransmissionTime(transmissionTime.size() == 0 ? "页面无传输时间描述" : transmissionTime.get(i)).setVerificationTime(verificationTime.size() == 0 ? "页面无验证时间描述" : verificationTime.get(i));entityList.add(proxyMiPu);LOGGER.info(proxyMiPu.toString());}miPuService.saveAll(entityList);curIndex = 0;}
}

SpringBoot启动类:

package icu.smile;import icu.smile.proxy.task.MiPuPageProcessor;
import icu.smile.proxy.task.MiPuPipeline;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.ConfigurableApplicationContext;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.downloader.selenium.SeleniumDownloader;
import us.codecraft.webmagic.scheduler.BloomFilterDuplicateRemover;
import us.codecraft.webmagic.scheduler.QueueScheduler;/*** <p>* 爬去米扑代理* </p>** @author starrysky* @since 2021/6/6*/
@SpringBootApplication
public class MiPuProxyApplication {private static String URL = "";public static void main(String[] args) {final ConfigurableApplicationContext ctx = SpringApplication.run(MiPuProxyApplication.class, args);System.setProperty("selenuim_config","src/main/resources/config.ini");Spider.create(new MiPuPageProcessor()).addUrl(URL).setDownloader(new SeleniumDownloader()).addPipeline(ctx.getBean(MiPuPipeline.class)).setScheduler(new QueueScheduler().setDuplicateRemover(new BloomFilterDuplicateRemover(10000000))).thread(1).runAsync();}
}

resources资源:

application.yml:

server:port: 8849
spring:datasource:driver-class-name: com.mysql.cj.jdbc.Driverurl: jdbc:mysql://47.111.237.28:3388/mipuproxy?characterEncoding=utf8&useSSL=false&serverTimezone=Asia/Shanghaiusername: rootpassword: roottype: com.alibaba.druid.pool.DruidDataSource# 连接池配置druid:# 初始化连接池连接数量initial-size: 5# 最小连接数量min-idle: 5# 最大连接池数量max-active: 20# 配置获取练级等待超时时间max-wait: 60000# 配置间隔多久进行一次检测时间,检测时需要关闭空闲时间,单位为毫秒time-between-eviction-runs-millis: 60000# 配置连接池最小生存时间min-evictable-idle-time-millis: 30000validation-query: SELECT 1 FROM DUALtest-while-idle: truetest-on-borrow: truetest-on-return: false# 是否缓存preparedStatement,也就是PSCache  官方建议MySQL下建议关闭   个人建议如果想用SQL防火墙 建议打开pool-prepared-statements: truemax-pool-prepared-statement-per-connection-size: 20# 配置监控统计拦截的filters,去掉后监控界面sql无法统计,'wall'用于防火墙filter:stat:merge-sql: trueslow-sql-millis: 5000# 基础监控配置web-stat-filter:enabled: trueurl-pattern: /*# 设置不统计哪些URLexclusions: "*.js,*.gif,*.jpg,*.png,*.css,*.ico,/druid/*"session-stat-enable: truesession-stat-max-count: 100stat-view-servlet:enabled: trueurl-pattern: /druid/*reset-enable: true# 设置监控页面的登录名和密码login-username: adminlogin-password: admin# 允许访问的IPallow: 127.0.0.1# 不允许访问的IP#deny: 192.168.1.100default-auto-commit: truejpa:database: mysqlshow-sql: trueopen-in-view: falsehibernate:naming:physical-strategy: org.hibernate.boot.model.naming.PhysicalNamingStrategyStandardImpldevtools:restart:enabled: true

selenium.ini:

driver=chrome
chrome_exec_path=/usr/local/bin/chromedriver
safari_driver_loglevel=DEBUG

更多推荐

webmagic+selenium+tesseract

本文发布于:2024-02-17 19:05:54,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1695159.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:webmagic   selenium   tesseract

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!