问题描述
限时送ChatGPT账号..我正在从数据帧创建 Copus.我将它作为 VectorSource
传递,因为我只想将一列用作文本源.这可以找到,但是我需要语料库中的文档 ID 来匹配数据框中的文档 ID.文档 ID 存储在原始数据框中的单独列中.
I am creating a Copus from a dataframe. I pass it as a VectorSource
as there is only one column I want to be used as the text source. This works find however I need the document ids within the corpus to match the document ids from the dataframe. The document ids are stored in a separate column in the original dataframe.
df <- as.data.frame(t(rbind(c(1,3,5,7,8,10),
c("text", "lots of text", "too much text", "where will it end", "give peas a chance","help"))))
colnames(df) <- c("ids","textColumn")
library("tm")
library("lsa")
corpus <- Corpus(VectorSource(df[["textColumn"]]))
运行此代码会创建一个语料库,但文档 ID 从 1 到 6 运行.有没有办法创建文档 ID 为 1、3、5、7、8、10 的语料库?
Running this code creates a corpus however the document ids run from 1-6. Is there any way of creating the corpus with the document ids 1,3,5,7,8,10?
推荐答案
嗯,一种简单但不是很优雅的方式来分配你的 id 到你的文档可能如下:
Well, one simple but not very elegant way to assign your ids to your documents afterward could be the following :
for (i in 1:length(corpus)) {
attr(corpus[[i]], "ID") <- df$ids[i]
}
这篇关于如何在语料库中手动设置文档 ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论