如何在R中将文本分成两个有意义的词

编程入门行业动态更新时间:2024-10-28 00:23:51

本文介绍了如何在R中将文本分成两个有意义的词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

这是我的数据框df中的文本，其中有一个名为'problem_note_text'的文本列

this is the text in my dataframe df which has a text column called 'problem_note_text'

SSCIssue:票据分发器故障执行检查/分发器故障/要求商店将票据分发器取出并放回原位/还是错误消息指出前门是打开的，因此CE attn req联系人详细信息-Olivia taber 01159063390/7 am-11pm

SSCIssue: Note Dispenser Failureperformed checks / dispensor failure / asked the stores to take the note dispensor out and set it back / still error message says front door is open / hence CE attn reqContact details - Olivia taber 01159063390 / 7am-11pm

df$problem_note_text <- tolower(df$problem_note_text) df$problem_note_text <- tm::removeNumbers(df$problem_note_text) df$problem_note_text<- str_replace_all(df$problem_note_text, " ", "") # replace double spaces with single space df$problem_note_text = str_replace_all(df$problem_note_text, pattern = "[[:punct:]]", " ") df$problem_note_text<- tm::removeWords(x = df$problem_note_text, stopwords(kind = 'english')) Words = all_words(df$problem_note_text, begins.with=NULL)

现在有一个数据框，其中包含单词列表，但有些单词是

Now have a dataframe which has a list of words but there are words like

执行失败"

"Failureperformed"

需要分为两个有意义的词，例如

which needs to be split into two meaningful words like

失败"执行".

"Failure" "performed".

我该怎么做，单词dataframe也包含像

how do I do this, also the words dataframe also contain words like

"im"，"h"

"im" , "h"

这没有意义，必须删除，我不知道该如何实现.

which do not make sense and have to be removed, I do not know how to achieve this.

推荐答案

给出一个英语单词列表，您只需在列表中查找单词的所有可能拆分即可轻松完成此操作.我将使用发现的第一个Google匹配词作为我的单词列表，其中包含约7万个小写单词:

Given a list of English words you can do this pretty simply by looking up every possible split of the word in the list. I'll use the first Google hit I found for my word list, which contains about 70k lower-case words:

wl <- read.table("www-personal.umich.edu/~jlawler/wordlist")$V1 check.word <- function(x, wl) { x <- tolower(x) nc <- nchar(x) parts <- sapply(1:(nc-1), function(y) c(substr(x, 1, y), substr(x, y+1, nc))) parts[,parts[1,] %in% wl & parts[2,] %in% wl] }

这有时可行:

check.word("screenunable", wl) # [1] "screen" "unable" check.word("nowhere", wl) # [,1] [,2] # [1,] "no" "now" # [2,] "where" "here"

但是有时当相关单词不在单词列表中时也会失败(在这种情况下，缺少传感器"):

But also sometimes fails when the relevant words aren't in the word list (in this case "sensor" was missing):

check.word("sensoradvise", wl) # # [1,] # [2,] "sensor" %in% wl # [1] FALSE "advise" %in% wl # [1] TRUE

更多推荐

如何在R中将文本分成两个有意义的词

本文发布于:2023-10-08 09:56:13，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1472285.html