使用TextRank算法进行文本摘要提取（python代码）|电子爱好者

admin管理员组
文章数量:1629931

文本摘要是自然语言处理(NLP)的一种应用，随着人工智能的发展文本提取必将对我们的生活产生巨大的影响。随着网络的发展我们处在一个信息爆炸的时代，通读每天更新的海量文章/文档/书籍会占用我们大量的时间，所以用一种算法帮我们提取一篇文章的关键信息是非常高效的。谢天谢地，这项技术已经出现了。你有没有遇到过inshorts的手机应用?这是一款创新的新闻应用程序，可以将新闻文章转换成60字的摘要。这正是我们在这篇文章中要学习的——自动文本摘要提取。

自动文本摘要早在20世纪50年代就引起了人们的注意。汉斯•彼得•鲁恩(Hans Peter Luhn)在20世纪50年代末发表了一篇研究论文，题为《文学文摘的自动创作》(the automatic creation of literature abstracts)。该论文利用词频和短语频等特征，从文本中提取重要句子进行总结。
另一项重要的研究是Harold P Edmundson在20世纪60年代末所做的，该研究利用线索词的出现、出现在文章标题中的词以及句子的位置等方法，提取出有意义的句子进行文本总结。从那时起，许多重要和令人兴奋的研究已经发表，以解决自动文本摘要的挑战。

下面来看TextRank算法进行网球类文章的摘要提取实例

一、TextRank算法流程

我们选取网球类文章来进行我们的文本摘要提取实战，我们将以多篇文章作为输入，并生成单个项目符号摘要。本文不讨论多域文本摘要，但您可以在文章末尾尝试它。

数据集地址：https://s3-ap-south-1.amazonaws/av-blog-media/wp-content/uploads/2018/10/tennis_articles_v4.csv

三、python代码

1、导入库

import numpy as np
import pandas as pd
import nltk
nltk.download('punkt') # one time execution
import re

2、读取并查看数据

df = pd.read_csv("tennis_articles_v4.csv")
df.head()

3、将文本分成句子

from nltk.tokenize import sent_tokenize
sentences = []
for s in df['article_text']:
  sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

4、下载GloVe词嵌入

GloVe词嵌入是词的向量表示。这些词的嵌入将被用来为我们的句子创建向量。我们也可以使用单词包或TF-IDF方法为句子创建特征，但是这些方法忽略了单词的顺序(特征的数量通常相当大)。我们将使用预先培训的维基百科2014 + Gigaword5 GloVe矢量，这些单词嵌入的大小是822 MB。

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

5、提取单词嵌入或单词向量

# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

6、文本处理

对文本数据做一些基本的文本清理以尽可能避免文本数据的噪音对摘要提取的影响。

# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

7、句子的向量表示

# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

8、创建相似矩阵
为了找出句子之间的相似点，我们将使用余弦相似法来解决这个问题。让我们为这个任务创建一个空的相似性矩阵，并用句子的余弦相似性填充它。

# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

9、实验TextRank算法

在这里我们将相似矩阵sim_mat转换为图形。图中的节点表示句子，边表示句子之间的相似度得分。在这个图中，我们将使用PageRank算法得到句子的排名。

import networkx as nx
nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

#Summary Extraction
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)
# Extract top 10 sentences as the summary
for i in range(10):
  print(ranked_sentences[i][1])

输出结果

When I'm on the courts or when I'm on the court playing, I'm a competitor and I want to beat every single person 
whether they're in the locker room or across the net.So I'm not the one to strike up a conversation about the 
weather and know that in the next few minutes I have to go and try to win a tennis match.

Major players feel that a big event in late November combined with one in January before the Australian Open will 
mean too much tennis and too little rest.

Speaking at the Swiss Indoors tournament where he will play in Sundays final against Romanian qualifier Marius 
Copil, the world number three said that given the impossibly short time frame to make a decision, he opted out of 
any commitment.

"I felt like the best weeks that I had to get to know players when I was playing were the Fed Cup weeks or the 
Olympic weeks, not necessarily during the tournaments.

Currently in ninth place, Nishikori with a win could move to within 125 points of the cut for the eight-man event 
in London next month.

He used his first break point to close out the first set before going up 3-0 in the second and wrapping up the 
win on his first match point.
The Spaniard broke Anderson twice in the second but didn't get another chance on the South African's serve in the 
final set.

"We also had the impression that at this stage it might be better to play matches than to train.

The competition is set to feature 18 countries in the November 18-24 finals in Madrid next year, and will replace 
the classic home-and-away ties played four times per year for decades.

Federer said earlier this month in Shanghai in that his chances of playing the Davis Cup were all but non-existent.

具体代码和数据参看github：https://github/prateekjoshi565/textrank_text_summarization

结语：

更多机器学习算法的学习欢迎关注我们。对机器学习感兴趣的同学欢迎大家转发&转载本公众号文章，让更多学习机器学习的伙伴加入公众号《python练手项目实战》，在实战中成长。

本文标签：算法摘要文本代码 TextRank

版权声明：本文标题：使用TextRank算法进行文本摘要提取（python代码）内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/dongtai/1729057217a1184106.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

更多相关文章

xp系统

电子爱好者 - 最新技术资讯及电子产品介绍！

使用TextRank算法进行文本摘要提取（python代码）

更多相关文章

黑苹果OC引导关闭开机跑代码模式，小白教程

Louvain community detection method(社区社区检测算法原理介绍+python版代码实现)

微信QQ如何制作强制浏览器打开 微信跳转外部浏览器代码的实现原理

基于深度学习的商品标签识别系统（网页版+YOLOv8v7v6v5代码+训练数据集）

【TF-IDF】传统方法TF-IDF解决短文本相似度问题

低代码开发平台_低代码平台将如何改变软件开发

基础设施即代码（Infrastructure as Code）

显示计算机101代码,电脑出现蓝屏故障101停机码，怎么解决问题

linux让别人电脑蓝屏,愚人节必备，教你制作整人神器，用代码实现计算机蓝屏...

常见计算机蓝屏代码,常见电脑蓝屏代码大全

利用 ChatGPT4 回复的 markdown 文本生成 ppt

嘿ChatGPT，来帮我写代码

python加注释的快捷键_详析python多行代码注释快捷键的用法

Android Studio快捷键Ctrl + Alt + L 格式化代码Reformat Code不起作用

pycharm使用快捷键自动对齐代码（Ctrl + Alt + L）

深度 | 朴素贝叶斯模型算法研究与实例分析

Kimi.ai与ChatGPT：长文本理解与科研辅助的比较研究

鸿蒙系统代码和安卓相似度,华为决定下月开源，鸿蒙系统抄袭安卓？460万行代码告诉你答案！...

Python实战：解决从PDF复制文本到翻译软件时的换行问题

目标检测算法——YOLOV8——算法详解

发表评论

推荐文章

mac电脑常用快捷键总结

格式工厂与转码宝：两款经典视频转码软件的对比

硬盘清空了还能恢复吗？一文揭晓正确答案！

华为手机 鸿蒙系统-android studio识别调试设备，开启adb调试权限

python提取pdf中的文字和图片_python 三种方法提取pdf中的图片

热门文章

关于win10无线网卡不可用，网络适配器出现黄色感叹号问题的修复方式

java 判断 重启服务_java判断端口是否启动以及重启Windows服务

苹果手机在升级系统时一直显示无服务器,苹果手机一直显示检测更新，怎么办？...

使用files-communityFiles工具为默认文件资源管理器想恢复默认的文件资源管理器

电脑分屏快捷键

Oracle Linux 安装 Oracle Grid Infrastructure 和 Real Application Cluster (RAC) 的详细教

如何使用格式工厂将vtt文件格式字幕加在视频文件中

windows7部署.net core web站点

Mark Text快捷键

【DDLC（心跳文学部）mod版分享】

最新文章

计算机课程用的ps是哪个版本,ps哪个版本适用于新手？

为什么ps不能用计算机,电脑ps软件的填充功能无法使用怎么处理

ps软件怎么测试性能,PS运行卡顿怎么办？如何提高PS软件性能？

Linux连接阿里云服务器的一系列命令教程

制作Ubuntu镜像并在虚拟机上安装

PS被禁用——解决PS跳出弹窗提示软件未经授权将被禁用方法

ps手柄震动测试软件,PS3 可实现震动 用PS3手柄连接电脑图文教程 - 电玩巴士

第一次的BLOG

红帽linux系统

Ubuntu 18.04 共享文件夹 与其他系统互传文件

PS占用CPU太高，导致电脑异常卡顿

使用ps ai功能电脑配置要求是什么？ps ai beta爱国版最低配置

【CanMV K230】安装使用

linux的系统

centos磁盘安装与磁盘分区方案详解

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

如何通过treenode实现二叉树

微信QQ如何制作强制浏览器打开微信跳转外部浏览器代码的实现原理

华为手机鸿蒙系统-android studio识别调试设备，开启adb调试权限

java 判断重启服务_java判断端口是否启动以及重启Windows服务

ps手柄震动测试软件,PS3 可实现震动用PS3手柄连接电脑图文教程 - 电玩巴士

Ubuntu 18.04 共享文件夹与其他系统互传文件

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载