我正在尝试使用 dplyr 和 grepl 来筛选大型数据集中的一些观察结果。如果其他解决方案更为优化,我不会收到 grepl 。
取此示例df:
df1 < - data.frame(fruit = c(apple,orange,xapple,xorange applexx,orangexx,banxana,appxxle),group = c(A,B)) df1 #水果组#1苹果A #2橙色B #3 xapple A #4 xorange B #5 applexx A #6 orangexx B #7 banxana A #8 appxxle B我想:
这显然是错误的的视图)过滤'appxxle'。
我从来没有完全掌握正则表达式。我一直在尝试修改代码,例如: grepl(^(?! x)。* $,df1 $ fruit,perl = TRUE)它可以在过滤器命令中工作,但不太了解。
预期输出:
#fruit group #1 apple A #2 orange B #3 banxana A #4 appxxle B如果可能,我想在 dplyr 内进行此操作。
解决方案我不明白你的第二个正则表达式,但是这个更基本的正则表达式似乎是诀窍:
df1%>%filter(!grepl(^ x | xx $,fruit)) ### 水果组 1苹果A 2橙色B 3 banxana A 4 appxxle B我认为你知道这一点,但是你根本就不必使用 dplyr
df1 [!grepl(^ x | xx $,df1 $ fruit),] ### fruit group 1苹果A 2橙色B 7 banxana A 8 appxxle B正则表达式正在寻找以 x 开始的字符串,或以 xx 结尾。 ^ 和 $ 分别是字符串的开头和结尾的正则表达式锚点。 | 是OR运算符。我们正在使用!取消 grepl 的结果,所以我们发现与内部不符的字符串正则表达式。
I am trying to work out how to filter some observations from a large dataset using dplyr and grepl . I am not wedded to grepl, if other solutions would be more optimal.
Take this sample df:
df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") ) df1 # fruit group #1 apple A #2 orange B #3 xapple A #4 xorange B #5 applexx A #6 orangexx B #7 banxana A #8 appxxle BI want to:
I have managed to work out how to get rid of everything that contains 'x' or 'xx', but not beginning with or ending with. Here is how to get rid of everything with 'xx' inside (not just ending with):
df1 %>% filter(!grepl("xx",fruit)) # fruit group #1 apple A #2 orange B #3 xapple A #4 xorange B #5 banxana AThis obviously 'erroneously' (from my point of view) filtered 'appxxle'.
I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.
Expected output:
# fruit group #1 apple A #2 orange B #3 banxana A #4 appxxle BI'd like to do this inside dplyr if possible.
解决方案I didn't understand your second regex, but this more basic regex seems to do the trick:
df1 %>% filter(!grepl("^x|xx$", fruit)) ### fruit group 1 apple A 2 orange B 3 banxana A 4 appxxle BAnd I assume you know this, but you don't have to use dplyr here at all:
df1[!grepl("^x|xx$", df1$fruit), ] ### fruit group 1 apple A 2 orange B 7 banxana A 8 appxxle BThe regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.
更多推荐
在grep中与dplyr进行过滤观察
发布评论