我正在尝试研究如何使用 dplyr 和 grepl 从大型数据集中过滤一些观察结果.如果其他解决方案更佳,我不喜欢 grepl.
I am trying to work out how to filter some observations from a large dataset using dplyr and grepl . I am not wedded to grepl, if other solutions would be more optimal.
以这个示例df:
df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange", "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") ) df1 # fruit group #1 apple A #2 orange B #3 xapple A #4 xorange B #5 applexx A #6 orangexx B #7 banxana A #8 appxxle B我想:
我已经设法摆脱所有包含x"或xx"但不以开头或结尾的内容.以下是如何摆脱内部包含 'xx' 的所有内容(不仅仅是结尾):
I have managed to work out how to get rid of everything that contains 'x' or 'xx', but not beginning with or ending with. Here is how to get rid of everything with 'xx' inside (not just ending with):
df1 %>% filter(!grepl("xx",fruit)) # fruit group #1 apple A #2 orange B #3 xapple A #4 xorange B #5 banxana A这显然错误地"(从我的角度来看)过滤了appxxle".
This obviously 'erroneously' (from my point of view) filtered 'appxxle'.
我从来没有完全掌握正则表达式.我一直在尝试修改代码,例如:grepl("^(?!x).*$", df1$fruit, perl = TRUE) 以尝试使其在过滤器命令中工作,但我不太明白.
I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.
预期输出:
# fruit group #1 apple A #2 orange B #3 banxana A #4 appxxle B如果可能的话,我想在 dplyr 中执行此操作.
I'd like to do this inside dplyr if possible.
推荐答案我不明白你的第二个正则表达式,但这个更基本的正则表达式似乎可以解决问题:
I didn't understand your second regex, but this more basic regex seems to do the trick:
df1 %>% filter(!grepl("^x|xx$", fruit)) ### fruit group 1 apple A 2 orange B 3 banxana A 4 appxxle B我假设您知道这一点,但您根本不必在这里使用 dplyr:
And I assume you know this, but you don't have to use dplyr here at all:
df1[!grepl("^x|xx$", df1$fruit), ] ### fruit group 1 apple A 2 orange B 7 banxana A 8 appxxle B正则表达式正在寻找以 x 开头或以 xx 结尾的字符串.^ 和 $ 分别是字符串开头和结尾的正则表达式锚点.| 是 OR 运算符.我们用 ! 否定 grepl 的结果,所以我们找到了与正则表达式中的内容不匹配的字符串.
The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.
更多推荐
结合 grepl 过滤 dplyr 中的观察结果
发布评论