爬取并处理150708个英语单词的例句

编程入门行业动态更新时间:2024-10-25 17:21:38

爬取并处理150708个英语单词的<a href=https://www.elefans.com/category/jswz/34/1748987.html style= 例句"/>

爬取并处理150708个英语单词的例句

还是爬取必应在线词典的内容。

实现代码主要用到了BeautifulSoup的select和w3lib.html的remove_tags，具体实现很简单，主要是确定所要爬取的例句在页面源代码中处于什么位置，即如下所示的标签顺序：

div> div > div > div > div.se_n_d
div > div > div > div > div > div > span.b_regtxt
div > div > div > div > div > div > a.p1-8.b_regtxt

直接通过观察定位很困难，可以借助于PaCharm的有关功能，具体如下所述：

1）首先在Bing在线词典查询一个单词，确定Bing提供了这个单词的注音、释义与例句（有缺少的内容不行）；

2）查看页面源代码，将所有源代码复制并粘贴到本地一个HTML文件；

3）使用PaCharm打开2）所新建的HTML文件；

4）在PaCharm的工具栏中选择Code选项，并选中Reformat Code，如图1所示；

图1 Reformat Code功能

4）此时将光标移动到HTML文件的任何位置，底部都会显示当前光标所在位置的标签顺序，如图2所示；

图2 标签顺序

5）（据说有个功能可以直接把标签路径提取出来，但我没找到，而是……）把底部的标签顺序记下来并放到Python程序的

相应位置。注意：并不是需要记录Pacharm底部所示的完整标签路径，而是从当前光标所在代码块的起始处记录，对于图2中的”

程序代码“，它的起初位置如图4所示。

图3 光标所在位置

图4 标签起始记录位置

接着是准备数据，将含有待爬取例句单词的文件放在指定文件夹下，如图5所示，并将相应路径填入代码中。

图5 待爬取例句的单词文件

所用Python代码如下所示。

import urllib.request
import re
from bs4 import BeautifulSoupmyRule5 = r'((http|ftp|https)://)?(([a-zA-Z0-9\._-]+\.[a-zA-Z]{2,6})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,4})*(/[a-zA-Z0-9\&%_\./-~-]*)?'
compile_name5 = repile(myRule5, re.M)
myRule6 = r'.*?\|.*?\'
compile_name6 = repile(myRule6, re.M)def readData(dataPath, fileName):file = open(dataPath + fileName, 'r', encoding='UTF-8')nameList = file.readlines()return nameListdef grab(dataPath, fileName, resultPath, style):nameList = readData(dataPath, fileName)for word in nameList:# web_addr = "=" + str(EnglishWord) + "&qs=n&form=Z9LH5&sp=-1&pq=" + str(EnglishWord) + "&sc=8-8&sk=&cvid=AA1C8A024F484889A4B788484086A7DF"# url = "=" + str(word).replace("\n", "") + "&qs=n&form=Z9LH5&sp=-1&pq=" + str(word).replace("\n", "") + "&sc=8-8&sk=&cvid=AA1C8A024F484889A4B788484086A7DF"url = "=" + str(word).replace("\n", "")res = urllib.request.urlopen(url)res.encoding = 'utf-8'resp = urllib.request.urlopen(url)# 读取网页源码内容text = resp.read()soup = BeautifulSoup(text, style)print("word = " + str(word))myList = open(resultPath + str(word).replace("\n", "") + ".txt", mode='w', encoding='utf-8')# numbers = soup.select('div> div > div > div > div.se_n_d')# acronyms = soup.select('div > div > div > div > div > div > span.b_regtxt')# English_Words = soup.select('div > div > div > div > div > div > a.p1-8.b_regtxt')all = soup.select('div > div > div > div > div > div > *.b_regtxt')delete_small_letter_acronym = Truei = 0for fuck in all:used = []from w3lib.html import remove_tagsshit = remove_tags(str(fuck))op = re.findall(myRule5, str(shit))qp = re.findall(myRule6, str(shit))if (shit[len(shit) - 1] != ' ' and shit[len(shit) - 1].isalnum() == False and shit[0].islower() == True and (len(op) == 0 and len(qp) == 0)):if (len(shit) >= 20 and delete_small_letter_acronym == False and shit.capitalize() not in used):print(shit.capitalize() + "23", file=myList)used.append(shit.capitalize())i += 1if (shit[len(shit) - 1] != ' ' and len(shit) >= 20 and shit[0] != ' ' and shit[len(shit) - 1].isalnum() == False and shit not in used and shit.capitalize() not in used):if (shit[0].islower() == True and shit.capitalize() not in used):print(shit.capitalize() + "24", file=myList)used.append(shit.capitalize())else:print(shit + "25", file = myList)used.append(shit)delete_small_letter_acronym = Falsei += 1myList.close()def is_Chinese(word):for ch in word:if '\u4e00' <= ch <= '\u9fff':return Truereturn Falseif __name__ == '__main__':# dataPath = "E://Document/English_Learning_Materials/Crawler/Word_Crawler_Meanings/test/"dataPath = "/home/non_alphabetical/"fileName = "result13.txt"# resultPath = "E://Document/English_Learning_Materials/Crawler/Word_Crawler_Meanings/result/"resultPath = "/home/Word_Meanings/result_13/"style = 'lxml'grab(dataPath, fileName, resultPath, style)

在将代码上传到服务器后，为了加快爬取速度，博主用screen同时开了多个爬虫进程，如图6所示。

图6 16个进程同时爬取

图7 16个爬虫进程同时运行时服务器状态

爬取得到的文件结果如图8所示。

图8 单词例句爬取结果

为了方便将文件导入Oracle数据库中，还需要将所得的文本内容稍加处理，变为如下格式：

13386	subcontracts	Total subcontracting requirements enterprises must sign in subcontracting transactions with subcontracts .<br>要求总分包企业必须在分包交易中依法签订分包合同。<br>The Material Overhead Rate will be applied to the cost of materials, equipment and subcontracts.<br>材料间接价格包含在材料、仪器和转包合同之中。<br>The persons accepting the subcontracts shall have corresponding qualifications and may not subcontract it again to other persons.<br>接受分包的人应当具备相应的资格条件，并不得再次分包。<br>If, thereafter, their subcontracts are for some reason reduced, such firms can face potentially crippling fixed expenses.<br>如果其后由于某种原因其合同量减少，它们将会遇到潜在的财政危机。<br>The contract for arranging claim to residue is composed of two core subcontracts.<br>所以从契约角度来探讨剩余索取权的不足就抓住了研究的核心。<br>Localities to establish as soon as surveying, design, construction, and supervision units contracts and subcontracts for the record system.<br>各地要尽快建立起勘察、设计、施工、监理单位的合同及分包合同备案制度。<br>A car rental company subcontracts out the repair and maintenance of its fleet, and focuses on renting.<br>同样，一个汽车租赁公司也会把修理和维护自己车队的工作转包给其它公司，只专注于租赁业务本身。<br>Subcontracts the whole of the works or assigns the contract without the required agreement.<br>未按要求经过许可便擅自将整个工程分包出去或转让合同。<br>Strengthen subcontracts for management.<br>Carrying out the control and management of main contracts and subcontracts.<br>负责公司工程合同以及工程分包合同治理工作。<br>

即：单词的序号、单词表示与单词的例句在同一行，中间用制表符隔开，行末也添加制表符。使用如下代码进行处理，代码

中的mainPath变量是之前所爬取到的单词例句文件所在的公共文件夹。

# coding=utf-8
import datetime
import os
import remyRule = r'\S$'
compile_name = repile(myRule, re.M)
myRule2 = r'23$|24$|25$'
compile_name2 = repile(myRule2, re.M)# global accessTot# def accessData(path, resultPath):
#     nameList = os.listdir(path)
#     accessTot = 0
#     for fileName in nameList:
#         print("accessTot: " + str(accessTot + 1) )
#         file = open(path + fileName, 'r', encoding='UTF-8').readlines()
#         newFileName = "iCAN.csv" # str(fileName).replace(".txt", ".log")
#         newFile = open( resultPath + newFileName, 'a', encoding = 'UTF-8' )
#         line = ""
#         line += str(accessTot + 1) + "\t"
#         line += str(fileName).replace(".txt", "")
#         line += "\t"
#         exampleSentences = ""
#         # exampleSentences += "'"
#         lineTot = len(file)
#         i = 0
#         for st in file:
#             st1 = re.sub(myRule2, "", str(st))
#             # print( "st1 = " + str( re.subn(myRule2, "", str(st)) ) )
#             if( i != lineTot -1 ):
#                 newLine = str(st1).strip()
#                 exampleSentences += newLine + '<br>'
#             else:
#                 newLine = str(st1).strip()
#                 exampleSentences += newLine
#
#         exampleSentences += "\t\n"
#         line += exampleSentences
#         newFile.write(line)
#         newFile.close()
#         accessTot += 1if __name__ == '__main__':mainPath = "E://Document/English_Learning_Materials/Crawler/Word_Example_Sentences/"resultPath = "E://Document/English_Learning_Materials/Crawler/Word_Example_Sentences_Modified/"timeLogPath = "E://Document/English_Learning_Materials/Crawler/Word_Example_Sentences_Modified/timeLog/"# for i in range(0, 21):#   path = mainPath + str(i) + "/"starttime = datetime.datetime.now()accessTot = 0for cnt in range(0, 21):# cnt = 0path = mainPath + "result_" + str(cnt) +"/"# accessData(path, resultPath)# accessTot = 0nameList = os.listdir(path)for fileName in nameList:print("accessTot: " + str(accessTot + 1))file = open(path + fileName, 'r', encoding='UTF-8').readlines()newFileName = "iCAN.csv"  # str(fileName).replace(".txt", ".log")newFile = open(resultPath + newFileName, 'a', encoding='UTF-8')line = ""line += str(accessTot + 1) + "\t"line += str(fileName).replace(".txt", "")line += "\t"exampleSentences = ""# exampleSentences += "'"lineTot = len(file)i = 0for st in file:st1 = re.sub(myRule2, "", str(st))# print( "st1 = " + str( re.subn(myRule2, "", str(st)) ) )if (i != lineTot - 1):newLine = str(st1).strip()exampleSentences += newLine + '<br>'else:newLine = str(st1).strip()exampleSentences += newLineexampleSentences += "\t\n"line += exampleSentencesnewFile.write(line)newFile.close()accessTot += 1endtime = datetime.datetime.now()timeLogFile = open(timeLogPath + "totalRunningTime.log", 'w', encoding='UTF-8')print( (endtime - starttime ).seconds, file = timeLogFile)print( (endtime - starttime ).seconds )timeLogFile.close()

程序将所有单词及其例句合并在一个文件中，处理所得的.csv文件如图9所示。值得一提的是，如上所示程序在本地运行的用

时（324秒）远大于在服务器的用时（17秒），具体如图10与图11所示。本地和服务器的机器信息已在上一篇博文中介绍，服务

器的Python版本为3.6.9，如图12所示。

图9 处理后的单词例句文件

图10 合并文本代码在本地运行耗时

图11 合并文本代码在服务器运行耗时

图12 服务器Python版本

使用PL/SQL工具将上面所得的.csv文件导入Oracle数据库中。首先，使用如下命令建立数据表：

CREATE TABLE vocabulary_List_Words_Example_Sentences(
word_ID number NOT NULL PRIMARY KEY,
single_Word VARCHAR(100) NOT NULL,
Exmaple_Sentences VARCHAR(20000) NOT NULL
);

而后，使用上一篇博文提到的Text Importer将.csv文件导入所建的数据表中。

将数据导入Oracle后查询数据表，结果如图13所示。

图13 查询单词例句表

图14 拥有例句的单词总量

下面将上一篇博文所得到的单词注音与释义数据表（vocabulary_List_words_pronunciations_meanings）与本次所得的单词

例句数据表（vocabulary_List_Words_Example_Sentences）合并，假定两表已经存在并写入了相关内容。

新建一个数据表，用于存储单词的注音、释义与例句，所用命令如下所示：

CREATE TABLE vocabulary_List_words_meanings_Example_Sentences(
word_ID number NOT NULL PRIMARY KEY,
single_Word VARCHAR(100) NOT NULL,
word_Meanings VARCHAR(5000) NOT NULL
);

新建表同单词注音与释义数据表结构相同，用意是先导入注音与释义文件，而后添加字段（一列），再将其与单词、例句表

合并，也可将已存在的两表数据直接合并，本次实验采用的方法是前者，所用命令如下所示：

alter table vocabulary_List_words_meanings_Example_Sentences add Exmaple_Sentences VARCHAR(20000);

merge into vocabulary_List_words_meanings_Example_Sentencesusing vocabulary_List_Words_Example_Sentenceson(vocabulary_List_words_meanings_Example_Sentences.single_Word = vocabulary_List_Words_Example_Sentences.single_Word)when matched then UPDATE set vocabulary_List_words_meanings_Example_Sentences.Exmaple_Sentences = vocabulary_List_Words_Example_Sentences.Exmaple_Sentences;

合并后的数据表查询结果如图15所示。

图15 查询合并后的单词释义、注音与例句数据表

可见，并非是所有单词都有例句，而这也是为什么在向数据表vocabulary_List_words_meanings_Example_Sentences添加

Exmaple_Sentences字段时不设置为“not null”的原因。合并后的记录总数如图16所示。

图16 合并后的记录总数

仍为150708，同预期相符。数据表中可能存在一定的重复记录，使用如下命令查询：

select * from vocabulary_List_words_meanings_Example_Sentences where single_word in (select single_word from vocabulary_List_words_meanings_Example_Sentences group by single_word having count(single_word) > 1)

所得结果如图17所示。

图17 查询单词注音、释义与例句表合并后的重复记录

图18 单词注音、释义与例句表合并后的重复记录的数量

其实，并非合并后产生了重复，而是原始的单词注音与释义数据表就存在一定程度的冗余记录。使用如下命令去重：

DELETE from vocabulary_List_words_meanings_Example_Sentences WHERE (single_word) IN ( SELECT single_word FROM vocabulary_List_words_meanings_Example_Sentences GROUP BY single_word HAVING COUNT(single_word) > 1) AND ROWID NOT IN (SELECT MIN(ROWID) FROM vocabulary_List_words_meanings_Example_Sentences GROUP BY single_word HAVING COUNT(*) > 1);

去重后再次查询重复的记录与记录数量分别如图19与图20所示。

图18 去重后查询单词注音、释义与例句表合并后的重复记录

图19 去重后单词注音、释义与例句表合并后的重复记录的数量

附上最终的文件下载地址（两个文件都是压缩包形式，博主均设置下载积分为0，不确定🐕B的CSDN不会恶意篡改）：

SQL格式

TSV格式

文件说明如表1所示，每个文件都有TSV与SQL两种格式。

表1 英语单词注音、释义与例句3个数据表导出信息详情
文件名	备注	记录数
duplicated_Word_Meanings_Pronounciations	含重复记录的单词、注音、释义表	150708
duplicated_Word_Example_Sentences	含重复记录的单词、例句表	99756
duplicated_word_pronounciations_meanings_example_sentences	含重复记录的单词、注音、释义、例句表	150708
unduplicated_Word_Meanings_Pronounciations	去重的单词、注音、释义表	144790
unduplicated_Word_Example_Sentences	去重的单词、例句表	99756
unduplicated_word_pronounciations_meanings_example_sentences	去重的单词、注音、释义、例句表	144790

最后说句题外话，不同的对数据库的查询（或修改）方式，时间消耗差距真的很明显。如下分别为在更新英语单词注音、释义与

例句数据表时使用的两条不同命令（数据表名和字段名不一样，因为博主后面修改了，方式1是晚上睡觉前放在电脑上跑的），对应的

用时分别如图20与图21所示——这警示我们，一定要理解原理。

--方式1
update vocabulary_List_words_meanings_Example_Sentences set 
vocabulary_List_words_meanings_Example_Sentences.exmaple_sentences = (
select Example_Sentences from vocabulary_List_Words_Example_Sentences 
where vocabulary_List_words_meanings_Example_Sentences.single_word=vocabulary_List_Words_Example_Sentences.single_word);

--方式2
merge into vocabulary_List_words_meanings_Example_Sentencesusing vocabulary_List_Words_Example_Sentenceson(vocabulary_List_words_meanings_Example_Sentences.single_Word = vocabulary_List_Words_Example_Sentences.single_Word)when matched then UPDATE set vocabulary_List_words_meanings_Example_Sentences.Exmaple_Sentences = vocabulary_List_Words_Example_Sentences.Exmaple_Sentences;

图20 方式1用时

图21 方式2用时

方式1耗时：7437.104秒≈123.95173分钟≈2.065862小时，而方式2耗时：36.148秒。

更多推荐

爬取并处理150708个英语单词的例句

本文发布于:2024-02-25 22:08:49，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1700445.html