我有一个非常大的数据框,有两列,分别叫做sentence1和sentence2 。 我试图用两个句子之间的词语做出新的专栏,例如:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three") sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six") df = as.data.frame(cbind(sentence1,sentence2))我的数据框架具有以下结构:
ID sentence1 sentence2 1 This is sentence one This is the sentence four 2 This is sentence two This is the sentence five 3 This is sentence three This is the sentence six我的预期结果是:
ID sentence1 sentence2 Expected_Result 1 This is ... This is ... one the four 2 This is ... This is ... two the five 3 This is ... This is ... three the six在RI正试图分割句子,并在得到列表之间不同的元素之后,例如:
df$split_Sentence1<-strsplit(df$sentence1, split=" ") df$split_Sentence2<-strsplit(df$sentence2, split=" ") df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)但是,这种方法在应用setdiff时setdiff 。
在Python中,我试图应用NLTK,先尝试获取令牌,然后提取两个列表之间的差异,如:
from nltk.tokenize import word_tokenize df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x)) df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))而在这一点上,我没有找到一个功能,它给我我需要的结果..
我希望你能帮助我。 谢谢
I have a very large data frame with two columns called sentence1 and sentence2. I am trying to make a new column with the words that differ between two sentences, for example:
sentence1=c("This is sentence one", "This is sentence two", "This is sentence three") sentence2=c("This is the sentence four", "This is the sentence five", "This is the sentence six") df = as.data.frame(cbind(sentence1,sentence2))My data frame has the following structure:
ID sentence1 sentence2 1 This is sentence one This is the sentence four 2 This is sentence two This is the sentence five 3 This is sentence three This is the sentence sixAnd my expected result is:
ID sentence1 sentence2 Expected_Result 1 This is ... This is ... one the four 2 This is ... This is ... two the five 3 This is ... This is ... three the sixIn R I was trying to split the sentences and after get the elements which differ between the lists, something like:
df$split_Sentence1<-strsplit(df$sentence1, split=" ") df$split_Sentence2<-strsplit(df$sentence2, split=" ") df$Dif<-setdiff(df$split_Sentence1, df$split_Sentence2)But this approach does not work when applying setdiff...
In Python I was trying to apply NLTK, trying to get tokens first and after extract the difference between the two lists, something like:
from nltk.tokenize import word_tokenize df['tokensS1'] = df.sentence1.apply(lambda x: word_tokenize(x)) df['tokensS2'] = df.sentence2.apply(lambda x: word_tokenize(x))And at this point I do not find a function which give me the result i need..
I hope you can help me. Thanks
最满意答案
这是一个R解决方案。
我创建了一个exclusiveWords函数,用于查找这两个集之间的唯一词,并返回由这些词组成的“句子”。 我将它包装在Vectorize()以便它可以一次处理data.frame的所有行。
df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F) exclusiveWords <- function(x, y){ x <- strsplit(x, " ")[[1]] y <- strsplit(y, " ")[[1]] u <- union(x, y) u <- union(setdiff(u, x), setdiff(u, y)) return(paste0(u, collapse = " ")) } exclusiveWords <- Vectorize(exclusiveWords) df$result <- exclusiveWords(df$sentence1, df$sentence2) df # sentence1 sentence2 result # 1 This is sentence one This is the sentence four the four one # 2 This is sentence two This is the sentence five the five two # 3 This is sentence three This is the sentence six the six threeHere's an R solution.
I've created an exclusiveWords function that finds the unique words between the two sets, and returns a 'sentence' made up of those words. I've wrapped it in Vectorize() so that it works on all rows of the data.frame at once.
df = as.data.frame(cbind(sentence1,sentence2), stringsAsFactors = F) exclusiveWords <- function(x, y){ x <- strsplit(x, " ")[[1]] y <- strsplit(y, " ")[[1]] u <- union(x, y) u <- union(setdiff(u, x), setdiff(u, y)) return(paste0(u, collapse = " ")) } exclusiveWords <- Vectorize(exclusiveWords) df$result <- exclusiveWords(df$sentence1, df$sentence2) df # sentence1 sentence2 result # 1 This is sentence one This is the sentence four the four one # 2 This is sentence two This is the sentence five the five two # 3 This is sentence three This is the sentence six the six three更多推荐
发布评论