如何使用R的正则表达式用大括号替换方括号?

编程入门行业动态更新时间:2024-10-25 23:26:05

本文介绍了如何使用R的正则表达式用大括号替换方括号?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

由于 pandoc-citeproc 和 latex 之间的转换，我想替换它

Due to conversions between pandoc-citeproc and latex I'd like to replace this

[@Fotheringham1981]

有了这个

\cite{Fotheringham1981}.

下面的可重现示例说明了单独处理每个括号的问题.

The issue with treating each bracket separately is illustrated in the reproducible example below.

x <- c("[@Fotheringham1981]", "df[1,2]") x1 <- gsub("\\[@", "\\\\cite{", x) x2 <- gsub("\\]", "\\}", x1) x2[1] # good ## [1] "\\cite{Fotheringham1981}" x2[2] # bad ## [1] "df[1,2}"

看到类似的问题已解决 C#，但不使用 R 的 perly 正则表达式 - 有什么想法吗?

Seen a similar issue solved for C#, but not using R's perly regex - any ideas?

它应该能够处理长文档，例如

It should be able to handle long documents, e.g.

old_rmd <- "$p = \alpha e^{\beta d}$ [@Wilson1971] and $p = \alpha d^{\beta}$ [@Fotheringham1981]." new_rmd1 <- gsub("\\[@([^\\]]*)\\]", "\\\\cite{\\1}", old_rmd, perl = T) new_rmd2 <- gsub("\\[@([^]]*)]", "\\\\cite{\\1}", old_rmd) new_rmd1 ## "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n \\cite{Fotheringham1981}." new_rmd2 ## [1] "$p = \alpha e^{\beta d}$ \\cite{Wilson1971} and $p = \alpha d^{\beta}$\n\\cite{Fotheringham1981}."

推荐答案

可以使用

gsub("\\[@([^]]*)]", "\\\\cite{\\1}", x)

参见 IDEONE 演示

正则表达式细分:

\\[@ - 文字 [@ 符号序列
([^]]*) - 匹配 0 次或多次出现的任何符号但 ] 的捕获组 1(请注意，如果 ] 出现在字符类的开头，不需要转义)
] - 文字 ] 符号

\\[@ - a literal [@ symbol sequence
([^]]*) - a capture group 1 that matches 0 or more occurrences of any symbol but a ] (note that if ] appears at the beginning of a character class, it does not need escaping)
] - a literal ] symbol

您不需要将 perl=T 与这个一起使用，因为字符类中的 ] 不会被转义.否则，将需要使用该选项.

You do not need to use perl=T with this one because the ] inside a character class is not escaped. Otherwise, it would require using that option.

另外，我相信我们应该只逃避必须逃避的东西.如果有办法避免反斜杠地狱，我们应该这样做.因此，您甚至可以使用

Also, I believe we should only escape what must be escaped. If there is a way to avoid backslash hell, we should. Thus, you can even use

gsub("[[]@([^]]*)]", "\\\\cite{\\1}", x)

这是另一个演示

为什么基于 TRE 的正则表达式比 PCRE 更好:

在 R 2.10.0 及更高版本中，默认的正则表达式引擎是 Ville Laurikari 的 TRE 引擎的修改版本 [来源].库的作者声明匹配花费的时间随着输入文本长度的增加而线性增长，而内存需求几乎不变(数十千字节).TRE 还说使用可预测和适度的内存消耗和二次最坏情况时间的长度使用的正则表达式匹配算法.这就是为什么在处理较大的文档时最好依赖 TRE 而不是 PCRE 正则表达式.

In R 2.10.0 and later, the default regex engine is a modified version of Ville Laurikari's TRE engine [source]. The library's author states that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). TRE is also said to use predictable and modest memory consumption and a quadratic worst-case time in the length of the used regular expression matching algorithm. That is why it seems best to rely on TRE rather than on PCRE regex when dealing with larger documents.