自然语言处理（七）：AG_NEWS新闻分类任务（TORCHTEXT）|电子爱好者

admin管理员组
文章数量:1593971

自然语言处理笔记总目录

关于新闻主题分类任务： 以一段新闻报道中的文本描述内容为输入，使用模型帮助我们判断它最有可能属于哪一种类型的新闻，这是典型的文本分类问题,，我们这里假定每种类型是互斥的，即文本描述有且只有一种类型

本案例取自Pytorch官网的：TEXT CLASSIFICATION WITH THE TORCHTEXT LIBRARY，在此基础上增加了完整的注释以及通俗的讲解

本案例分为以下九个步骤

Step 1：Access to the raw dataset iterators

AG_NEWS数据集介绍：

AG_NEWS：新闻语料库，包含4个大类新闻：World、Sports、Business、Sci/Tec。

AG_NEWS共包含120000条训练样本集（train.csv)， 7600测试样本数据集(test.csv)。每个类别分别拥有 30000 个训练样本及 1900 个测试样本。

import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')

返回的是一个训练集的迭代器，通过以下方法可以查看训练集的内容：

next(train_iter)
>>> (3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters -
Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green
again.")

next(train_iter)
>>> (3, 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private
investment firm Carlyle Group,\\which has a reputation for making well-timed
and occasionally\\controversial plays in the defense industry, has quietly
placed\\its bets on another part of the market.')

Step 2：Prepare data processing pipelines

在训练之前，首先我们要处理新闻数据，对文本进行分词，构建词汇表vocab

使用get_tokenizer进行分词，同时build_vocab_from_iterator提供了使用迭代器构建词汇表的方法

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')	# 基本的英文分词器
train_iter = AG_NEWS(split='train')	# 训练数据迭代器

# 分词生成器
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

# 构建词汇表
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
# 设置默认索引，当某个单词不在词汇表中，则返回0
vocab.set_default_index(vocab["<unk>"])

vocab(['here', 'is', 'an', 'example'])
>>> [475, 21, 30, 5286]
print(vocab(["haha", "hehe", "xixi"]))
>>> [0, 0, 0]

接下来使用分词器以及词汇表构建Pipeline

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

text_pipeline('here is an example')
>>> [475, 21, 30, 5286]
label_pipeline('10')
>>> 9

Step 3：Generate data batch and iterator

from torch.utils.data import DataLoader
# 使用GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 定义collate_batch函数，在DataLoader中会使用，对传入的样本数据进行批量处理
def collate_batch(batch):
	# 存放label以及text的列表，offses存放每条text的偏移量
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         # 将每一条数据的长度放入offsets列表当中
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    # 计算出每一条text的偏移量
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

cumsum()用于计算一个数组各行的累加值，示例如下：

>>>a = [1, 2, 3, 4, 5, 6, 7]
>>>cumsum(a)
array([1, 3, 6, 10, 15, 21, 28])

Step 4：Define the model

定义神经网络模型： 由EmbeddingBag、隐藏层和全连接层组成

from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

Step 5：Initiate an instance

AG_NEWS 数据集有四个标签，因此类的数量是四个

1 : World
2 : Sports
3 : Business
4 : Sci/Tec

实例一个模型

train_iter = AG_NEWS(split='train')
num_class = len(set([label for (label, text) in train_iter]))	# 获取分类数量
vocab_size = len(vocab)	# 词汇表大小
emsize = 64	# 词嵌入维度
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

Step 6：Define functions to train the model and evaluate results

import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

梯度裁剪 torch.nn.utils.clip_grad_norm_() 的使用应该在loss.backward()之后，optimizer.step()之前.

注意这个方法只在训练的时候使用，在测试的时候验证和测试的时候不用。

Step 7：Split the dataset and run the model

拆分训练集：拆分比率为训练集95%，验证集5%，使用torch.utils.data.dataset.random_split函数

to_map_style_dataset函数是将数据集从iterator变为map的形式，可以直接索引

from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)

total_accu = None

train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid accuracy {:8.3f} '
          .format(epoch, time.time() - epoch_start_time, accu_val))
    print('-' * 59)

输出：

| epoch   1 |   500/ 1782 batches | accuracy    0.689
| epoch   1 |  1000/ 1782 batches | accuracy    0.856
| epoch   1 |  1500/ 1782 batches | accuracy    0.876
-----------------------------------------------------------
| end of epoch   1 | time:  8.17s | valid accuracy    0.882
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.897
| epoch   2 |  1000/ 1782 batches | accuracy    0.904
| epoch   2 |  1500/ 1782 batches | accuracy    0.900
-----------------------------------------------------------
| end of epoch   2 | time:  8.39s | valid accuracy    0.893
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.914
| epoch   3 |  1000/ 1782 batches | accuracy    0.916
| epoch   3 |  1500/ 1782 batches | accuracy    0.913
-----------------------------------------------------------
| end of epoch   3 | time:  8.44s | valid accuracy    0.903
-----------------------------------------------------------
| epoch   4 |   500/ 1782 batches | accuracy    0.924
| epoch   4 |  1000/ 1782 batches | accuracy    0.923
| epoch   4 |  1500/ 1782 batches | accuracy    0.924
-----------------------------------------------------------
| end of epoch   4 | time:  8.43s | valid accuracy    0.908
-----------------------------------------------------------
| epoch   5 |   500/ 1782 batches | accuracy    0.932
| epoch   5 |  1000/ 1782 batches | accuracy    0.930
| epoch   5 |  1500/ 1782 batches | accuracy    0.926
-----------------------------------------------------------
| end of epoch   5 | time:  8.37s | valid accuracy    0.903
-----------------------------------------------------------
| epoch   6 |   500/ 1782 batches | accuracy    0.941
| epoch   6 |  1000/ 1782 batches | accuracy    0.943
| epoch   6 |  1500/ 1782 batches | accuracy    0.941
-----------------------------------------------------------
| end of epoch   6 | time:  8.14s | valid accuracy    0.908
-----------------------------------------------------------
| epoch   7 |   500/ 1782 batches | accuracy    0.944
| epoch   7 |  1000/ 1782 batches | accuracy    0.942
| epoch   7 |  1500/ 1782 batches | accuracy    0.944
-----------------------------------------------------------
| end of epoch   7 | time:  8.15s | valid accuracy    0.907
-----------------------------------------------------------
| epoch   8 |   500/ 1782 batches | accuracy    0.943
| epoch   8 |  1000/ 1782 batches | accuracy    0.943
| epoch   8 |  1500/ 1782 batches | accuracy    0.945
-----------------------------------------------------------
| end of epoch   8 | time:  8.15s | valid accuracy    0.907
-----------------------------------------------------------
| epoch   9 |   500/ 1782 batches | accuracy    0.943
| epoch   9 |  1000/ 1782 batches | accuracy    0.944
| epoch   9 |  1500/ 1782 batches | accuracy    0.945
-----------------------------------------------------------
| end of epoch   9 | time:  8.15s | valid accuracy    0.907
-----------------------------------------------------------
| epoch  10 |   500/ 1782 batches | accuracy    0.943
| epoch  10 |  1000/ 1782 batches | accuracy    0.944
| epoch  10 |  1500/ 1782 batches | accuracy    0.945
-----------------------------------------------------------
| end of epoch  10 | time:  8.15s | valid accuracy    0.907
-----------------------------------------------------------

Step 8：Evaluate the model with test dataset

检验模型在测试集上的效能

print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

输出：

Checking the results of test dataset.
test accuracy    0.909

Step 9：Test on a random news

随机输入一段新闻，测试模型效果：

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, pipeline):
    with torch.no_grad():
        text = torch.tensor(pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to('cpu')
res = predict(ex_text_str, text_pipeline)
print("This is a %s news" % ag_news_label[res])

结果：

This is a Sports news

本文标签：自然语言新闻 AGNEWS torchtext

版权声明：本文标题：自然语言处理（七）：AG_NEWS新闻分类任务（TORCHTEXT）内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/dongtai/1728181806a1148459.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

自然语言处理（七）：AG_NEWS新闻分类任务（TORCHTEXT）

自然语言处理笔记总目录

Step 1：Access to the raw dataset iterators

Step 2：Prepare data processing pipelines

Step 3：Generate data batch and iterator

Step 4：Define the model

Step 5：Initiate an instance

Step 6：Define functions to train the model and evaluate results

Step 7：Split the dataset and run the model

Step 8：Evaluate the model with test dataset

Step 9：Test on a random news

更多相关文章

今天新闻最新消息作文

新闻报道最新消息今天作文

新闻播报发言稿模板

新闻不真实的例子

【python实现网络爬虫（12）】JSON解析之爬取腾讯新闻

indesign 显示黑屏_新闻:成都ipad mini3睡眠后无法唤醒黑屏维修价格

Python3.6 写网络爬虫爬取腾讯新闻内容

获取腾讯新闻APP文章、真实视频地址解析

关于腾讯的那些事(6月新闻纪要)

无广告、无推送、无新闻，这7款手机浏览器实用且优秀

本周大新闻｜索尼PS VR2立项近7年；传腾讯将引进Quest 2

[转载] 机器学习科普文章：“一文读懂机器学习，大数据自然语言处理算法全有了”

读懂人工智能、机器学习、深度学习、大数据，自然语言处理……

ChatGPT的前世今生: 从GPT-1到GPT-4，自然语言处理的壮丽蜕变

【招聘】搜狗输入法-自然语言处理研究员

【自然语言处理NLP】三句半自动生成器

NLP_新闻主题分类_7（代码示例）

新版torchtext 0.15.0 API 使用

【PyTorch】7 文本分类TorchText实战——AG_NEWS四类别新闻分类

新闻主题分类任务

发表评论

推荐文章

ZOJ 3804 YY's Minions（简单模拟）

故障樹分析 (Fault Tree Analysis)：找到根本原因 (Root Cause Analysis)

Hi，大家好，这里是iOS用的 Flash 播放器 FlashViewer

通过计算机的启动过程了解BIOS和UEFI

Paper：论文解读《Adaptive Gradient Methods With Dynamic Bound Of Learning Rate》中国本科生提出AdaBound的神经网络优化算法

热门文章

文件加密：pdf加密如何解除？PDF加密解密的7个工具分享!

SQL Server 2019 软件安装包免费下载以及安装教程

【信息安全】-病毒 VS 木马 VS 蠕虫

反病毒攻防研究第005篇：简单木马分析与防范part1

快速安装最新版Burp Suite Professional

哪里可以找到免费的 PDF 阅读编辑器？7 个免费 PDF 阅读编辑器分享

java Springboot富文本编辑器ueditor的内容使用itext5导出为pdf文件

任何格式的声音转换，MP3转换为g711

U盘数据恢复不再难：2024年4款工具，找回你“躲藏”的记忆

端口扫描工具nmap扫描技术和使用介绍

最新文章

8种企业赢利模式

无线增值宝典

【精品，你所不知道的IT高薪】【转贴】清华生7天猎头生活的发现!

凉宫春日的忧郁第二章

计算机科学与技术学习心得

净室软件工程随笔 ----《零缺陷程序设计》读书笔记

深入浅出软件开发技术名词_1

强烈建议每一个想成功的程序员读一读此文章

富爸爸,穷爸爸

管理小故事精髓 100例(转)

创业者怎样才能赚到钱？八种最有效创业赢利模式

软件本地化与汉化

创业知识

管理小故事精髓 100例

50个最好的firefox扩展让你尽情冲浪

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载