词干文档 R 文本挖掘

编程入门行业动态更新时间:2024-10-23 15:24:40

本文介绍了词干文档 R 文本挖掘的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

我的数据是一个txt文件，如下所示:
字数_doc
概述 1
客户 1
商店 1
玛格 1
价格 2
库存2
经济学2

My data is a txt file and looks as follows:
words number_doc
overwiew 1
client 1
store 1
marge 1
price 2
stock 2
economics 2

文档编号按顺序排列(从小到大).现在我想为每个文档包含属于该文档的所有单词.现在它们站在一列中，但我想要 textDocument 中的所有单词(来自包 tm，因为它对于该包中的某些功能是必需的).我是这样做的:

The numbers of the documents are sorted (from the smallest to the largest). Now I want for each document all the words that belongs to the document. Now they stand in a column, but I want al the words in a textDocument (from the package tm, because it is neccesary for some functions in that package). I did this as follows:

 data <- read.table("poging.txt", header = TRUE)
 data

 doc <- c()
 #I paste all the words from a document together:
 doc[1] <- paste(data[1:4,1], collapse = ' ')
 doc[2] <- paste(data[1:4,1], collapse = ' ')

 #Make a data.frame of it
 doc_df <- data.frame(docs = doc, row.names = 1:2)

 #Install package
 install.packages("tm")
 library(tm)

 #Make a Dataframesource of it so that each row is seen as a document
 ds <- DataframeSource(doc_df)
 inspect(VCorpus(ds))

 #Now I want to stem for example document number 1
 stemDocument(ds[[1]])

但是通过使用 ds[[1]] 作为参数，它不起作用.他找不到文件编号 1.有人可以帮我吗?

But by using ds[[1]] as argument, it doesn't work. He can't find document number 1. Can someone help me?

在包 tm 的示例中，它们使用数据 crude.我希望我的数据与 crude 中的数据格式相同.

In the examples om the package tm they use the data crude. I want that my data is the same format as that from crude.

丝绸

推荐答案

stemDocument() 旨在用于 TextDocument，而不是 DataSource.您想使用 DataSource 创建一个语料库，然后您可以从那里提取文档.

stemDocument() is meant to be use with a TextDocument, not a DataSource. You want to use the DataSource to create a corpus, then you can extract the documents from there.

ds <- DataframeSource(doc_df)
corpus <- VCorpus(ds)
stemDocument(corpus[[1]])

请注意，stemDocument 将返回一个新文档，并且不会永久更新语料库.因此，如果您想对输出执行任何操作，请务必将其保存在某处.

Note that stemDocument will return a new document and will not update the corpus permanently. So if you wish to do anything with the output, be sure to save it somewhere.

这篇关于词干文档 R 文本挖掘的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-30 05:17:13，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1390197.html