提取两个句子之间不同的单词(Extract the words that differ between two sentences)

我有一个非常大的数据框，有两列，分别叫做sentence1和sentence2 。我试图用两个句子之间的词语做出新的专栏，例如：

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three") sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six") df = as.data.frame(cbind(sentence1,sentence2))

我的数据框架具有以下结构：

ID sentence1 sentence2 1 This is sentence one This is the sentence four 2 This is sentence two This is the sentence five 3 This is sentence three This is the sentence six

我的预期结果是：

ID sentence1 sentence2 Expected_Result 1 This is ... This is ... one the four 2 This is ... This is ... two the five 3 This is ... This is ... three the six

在RI正试图分割句子，并在得到列表之间不同的元素之后，例如：

df$split_Sentence1<-strsplit(df$sentence1, split=" ") df$split_Sentence2<-strsplit(df$sentence2, split=" ") df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

但是，这种方法在应用setdiff时setdiff 。

在Python中，我试图应用NLTK，先尝试获取令牌，然后提取两个列表之间的差异，如：

from nltk.tokenize import word_tokenize df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x)) df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))

而在这一点上，我没有找到一个功能，它给我我需要的结果..

我希望你能帮助我。谢谢

I have a very large data frame with two columns called sentence1 and sentence2. I am trying to make a new column with the words that differ between two sentences, for example:

sentence1=c("This is sentence one", "This is sentence two", "This is sentence three") sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six") df = as.data.frame(cbind(sentence1,sentence2))

My data frame has the following structure:

ID sentence1 sentence2 1 This is sentence one This is the sentence four 2 This is sentence two This is the sentence five 3 This is sentence three This is the sentence six

And my expected result is:

ID sentence1 sentence2 Expected_Result 1 This is ... This is ... one the four 2 This is ... This is ... two the five 3 This is ... This is ... three the six

In R I was trying to split the sentences and after get the elements which differ between the lists, something like:

df$split_Sentence1<-strsplit(df$sentence1, split=" ") df$split_Sentence2<-strsplit(df$sentence2, split=" ") df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)

But this approach does not work when applying setdiff...

In Python I was trying to apply NLTK, trying to get tokens first and after extract the difference between the two lists, something like:

from nltk.tokenize import word_tokenize df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x)) df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))

And at this point I do not find a function which give me the result i need..

I hope you can help me. Thanks

最满意答案

这是一个R解决方案。

我创建了一个exclusiveWords函数，用于查找这两个集之间的唯一词，并返回由这些词组成的“句子”。我将它包装在Vectorize()以便它可以一次处理data.frame的所有行。

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F) exclusiveWords <- function(x, y){ x <- strsplit(x, " ")[[1]] y <- strsplit(y, " ")[[1]] u <- union(x, y) u <- union(setdiff(u, x), setdiff(u, y)) return(paste0(u, collapse = " ")) } exclusiveWords <- Vectorize(exclusiveWords) df$result <- exclusiveWords(df$sentence1, df$sentence2) df # sentence1 sentence2 result # 1 This is sentence one This is the sentence four the four one # 2 This is sentence two This is the sentence five the five two # 3 This is sentence three This is the sentence six the six three

Here's an R solution.

I've created an exclusiveWords function that finds the unique words between the two sets, and returns a 'sentence' made up of those words. I've wrapped it in Vectorize() so that it works on all rows of the data.frame at once.

df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F) exclusiveWords <- function(x, y){ x <- strsplit(x, " ")[[1]] y <- strsplit(y, " ")[[1]] u <- union(x, y) u <- union(setdiff(u, x), setdiff(u, y)) return(paste0(u, collapse = " ")) } exclusiveWords <- Vectorize(exclusiveWords) df$result <- exclusiveWords(df$sentence1, df$sentence2) df # sentence1 sentence2 result # 1 This is sentence one This is the sentence four the four one # 2 This is sentence two This is the sentence five the five two # 3 This is sentence three This is the sentence six the six three

更多推荐

提取两个句子之间不同的单词(Extract the words that differ between two sentences)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表