当NA存在时，合并功能导致重复(merge function causing duplicates when NA is present)

合并给我一个虚假的大数据帧，从而导致来自NA的重复，即使两个组成数据帧具有基本相同的内容。我所追求的是一个合并的数据框，其中任何缺少的ID都有自己的列。

这是一个有两个几乎完全相同的数据帧的工作示例，只是NA在不同的位置。在正常使用中，这些将是ID列，其中较大的数据帧与它们相关联。

df1<-c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A") df2<-c("LJUL1994I", NA, "GMAY1994J","WJUN1994A")

我想要的是匹配工作这样：

LJUL1994I LJUL1994I GMAY1994J GMAY1994J WJUN1994A WJUN1994A <NA> <NA>

但是，我得到的是......

merge(df1,df2) x y 1 LJUL1994I LJUL1994I 2 GMAY1994J LJUL1994I 3 <NA> LJUL1994I 4 WJUN1994A LJUL1994I 5 LJUL1994I <NA> 6 GMAY1994J <NA> 7 <NA> <NA> 8 WJUN1994A <NA> 9 LJUL1994I GMAY1994J 10 GMAY1994J GMAY1994J 11 <NA> GMAY1994J 12 WJUN1994A GMAY1994J 13 LJUL1994I WJUN1994A 14 GMAY1994J WJUN1994A 15 <NA> WJUN1994A 16 WJUN1994A WJUN1994A

如果我摆弄设置，则会发生相同的输出（即all=TRUE ， incomparables=NA ）

对数据帧进行排序和整理是一个简单的解决方案，因为我想将其扩展到ID列长度不同的情况，并且可能具有不同数量的NA。

基本r解决方案更受欢迎，但如果它们更优雅，我将采用基于包的解决方案。

Merging gives me a spuriously large dataframe inducing duplicates from NAs, even if the two constituent dataframes have essentially identical content. What I'm after is a merged dataframe where any missing ID gets its own column.

Here's a worked example with two nearly identical data frames, just with NA in different positions. In normal usage, these would be ID columns, with larger data frames associated with them.

df1<-c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A") df2<-c("LJUL1994I", NA, "GMAY1994J","WJUN1994A")

What I would like is the matching to work like this:

LJUL1994I LJUL1994I GMAY1994J GMAY1994J WJUN1994A WJUN1994A <NA> <NA>

But, what I get is this...

merge(df1,df2) x y 1 LJUL1994I LJUL1994I 2 GMAY1994J LJUL1994I 3 <NA> LJUL1994I 4 WJUN1994A LJUL1994I 5 LJUL1994I <NA> 6 GMAY1994J <NA> 7 <NA> <NA> 8 WJUN1994A <NA> 9 LJUL1994I GMAY1994J 10 GMAY1994J GMAY1994J 11 <NA> GMAY1994J 12 WJUN1994A GMAY1994J 13 LJUL1994I WJUN1994A 14 GMAY1994J WJUN1994A 15 <NA> WJUN1994A 16 WJUN1994A WJUN1994A

The same output happens if I fiddle with settings (i.e. all=TRUE, incomparables=NA)

Sorting and cbinding the dataframes is a brittle solution, as I want to extend this to situations where the ID columns differ in length, and may have differing numbers of NAs.

Base r solutions preferred, but I'll take package-based solutions if they're more elegant.

最满意答案

你的合并看起来很有趣的原因是因为你传递的是字符向量而不是data.frames。这些字符向量被强制转换为data.frames，但由于它们具有不同的名称，因此每个创建的data.frame将具有不同的列名，因此当您合并两个没有重叠列名的data.frame时，您将获得完整的外部联接。

您可以在此处使用合并，但合并喜欢折叠共享列而不是复制它。既然你知道他们匹配。这是一次尝试

df1 <- data.frame(a=c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A")) df2 <- data.frame(a=c("LJUL1994I", NA, "GMAY1994J","WJUN1994A")) merge(df1, cbind(df2, b=df2$a), all=T) # a b # 1 GMAY1994J GMAY1994J # 2 LJUL1994I LJUL1994I # 3 WJUN1994A WJUN1994A # 4 <NA> <NA> # 5 <NA> <NA>

在合并到两列之前，我在df2中将“a”列加倍为“b”。

The reason your merge look funny is because you are passing in character vectors rather than data.frames. Those character vectors are being coerced to data.frames, but since they have different names, each created data.frame will have different column names so when you merge two data.frames with no overlapping column names, you get a full outer join.

You can use merge here, but merge likes to collapse the shared column rather than duplicate it. Since you know they match. Here's one attempt

df1 <- data.frame(a=c("LJUL1994I", "GMAY1994J",NA,"WJUN1994A")) df2 <- data.frame(a=c("LJUL1994I", NA, "GMAY1994J","WJUN1994A")) merge(df1, cbind(df2, b=df2$a), all=T) # a b # 1 GMAY1994J GMAY1994J # 2 LJUL1994I LJUL1994I # 3 WJUN1994A WJUN1994A # 4 <NA> <NA> # 5 <NA> <NA>

Here i doubled the "a" column as "b" in df2 before merging to get two columns.

更多推荐