提取两个句子之间不同的单词(Extract the words that differ between two sentences)

编程入门 行业动态 更新时间:2024-10-11 11:16:25
提取两个句子之间不同的单词(Extract the words that differ between two sentences)

我有一个非常大的数据框,有两列,分别叫做sentence1和sentence2 。 我试图用两个句子之间的词语做出新的专栏,例如:

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three") sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six") df = as.data.frame(cbind(sentence1,sentence2))

我的数据框架具有以下结构:

ID sentence1 sentence2 1 This is sentence one This is the sentence four 2 This is sentence two This is the sentence five 3 This is sentence three This is the sentence six

我的预期结果是:

ID sentence1 sentence2 Expected_Result 1 This is ... This is ... one the four 2 This is ... This is ... two the five 3 This is ... This is ... three the six

在RI正试图分割句子,并在得到列表之间不同的元素之后,例如:

df$split_Sentence1<-strsplit(df$sentence1, split=" ") df$split_Sentence2<-strsplit(df$sentence2, split=" ") df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

但是,这种方法在应用setdiff时setdiff 。

在Python中,我试图应用NLTK,先尝试获取令牌,然后提取两个列表之间的差异,如:

from nltk.tokenize import word_tokenize df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x)) df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))

而在这一点上,我没有找到一个功能,它给我我需要的结果..

我希望你能帮助我。 谢谢

I have a very large data frame with two columns called sentence1 and sentence2. I am trying to make a new column with the words that differ between two sentences, for example:

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three") sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six") df = as.data.frame(cbind(sentence1,sentence2))

My data frame has the following structure:

ID sentence1 sentence2 1 This is sentence one This is the sentence four 2 This is sentence two This is the sentence five 3 This is sentence three This is the sentence six

And my expected result is:

ID sentence1 sentence2 Expected_Result 1 This is ... This is ... one the four 2 This is ... This is ... two the five 3 This is ... This is ... three the six

In R I was trying to split the sentences and after get the elements which differ between the lists, something like:

df$split_Sentence1<-strsplit(df$sentence1, split=" ") df$split_Sentence2<-strsplit(df$sentence2, split=" ") df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

But this approach does not work when applying setdiff...

In Python I was trying to apply NLTK, trying to get tokens first and after extract the difference between the two lists, something like:

from nltk.tokenize import word_tokenize df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x)) df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))

And at this point I do not find a function which give me the result i need..

I hope you can help me. Thanks

最满意答案

这是一个R解决方案。

我创建了一个exclusiveWords函数,用于查找这两个集之间的唯一词,并返回由这些词组成的“句子”。 我将它包装在Vectorize()以便它可以一次处理data.frame的所有行。

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F) exclusiveWords <- function(x, y){ x <- strsplit(x, " ")[[1]] y <- strsplit(y, " ")[[1]] u <- union(x, y) u <- union(setdiff(u, x), setdiff(u, y)) return(paste0(u, collapse = " ")) } exclusiveWords <- Vectorize(exclusiveWords) df$result <- exclusiveWords(df$sentence1, df$sentence2) df # sentence1 sentence2 result # 1 This is sentence one This is the sentence four the four one # 2 This is sentence two This is the sentence five the five two # 3 This is sentence three This is the sentence six the six three

Here's an R solution.

I've created an exclusiveWords function that finds the unique words between the two sets, and returns a 'sentence' made up of those words. I've wrapped it in Vectorize() so that it works on all rows of the data.frame at once.

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F) exclusiveWords <- function(x, y){ x <- strsplit(x, " ")[[1]] y <- strsplit(y, " ")[[1]] u <- union(x, y) u <- union(setdiff(u, x), setdiff(u, y)) return(paste0(u, collapse = " ")) } exclusiveWords <- Vectorize(exclusiveWords) df$result <- exclusiveWords(df$sentence1, df$sentence2) df # sentence1 sentence2 result # 1 This is sentence one This is the sentence four the four one # 2 This is sentence two This is the sentence five the five two # 3 This is sentence three This is the sentence six the six three

更多推荐

本文发布于:2023-07-19 06:42:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1174830.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:句子   单词   两个   Extract   differ

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!