正则表达式用于解析组合日志与奇怪的用户代理字符串(Regular expression to parse combined log with strange user agent string)

编程入门 行业动态 更新时间:2024-10-25 08:15:40
正则表达式用于解析组合日志与奇怪的用户代理字符串(Regular expression to parse combined log with strange user agent string)

我使用下面的正则表达式来解析组合日志格式: -

^(?P<client>\S+) (?P<identd>\S+) (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? (?P<version>HTTP/[0-9.]+)" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "{1,3}(?P<useragent>[^"]*)"{1,3} "(?P<cookie>[^"]*)"

它适用于大多数日志,但随着时间的推移,有很多useragent有奇怪的“内部使正则表达式失败。我复制了下面的一些有问题的用户代理。很高兴,如果有人可以帮助修复我的正则表达式所以它也适用于以下奇怪的情况: -

“Mozilla / 5.0(Linux; U; Android 4.1.1; tr-tr; PIRANHA BUSINESS TAB 7”“Build / JRO03C)AppleWebKit / 534.30(KHTML,与Gecko一样)Version / 4.0 Safari / 534.30 GSA / 2.0.0.392829”

“Mozilla / 4.0(兼容; MSIE 8.0; Windows NT 5.1; Trident / 4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; xhcueef7 $#$%fjidf87jcnuFfFJH6 @@ jjfidjcu%09348%”“=”“IEAK )”

I use the below regular expression to parse combined log format:-

^(?P<client>\S+) (?P<identd>\S+) (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? (?P<version>HTTP/[0-9.]+)" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "{1,3}(?P<useragent>[^"]*)"{1,3} "(?P<cookie>[^"]*)"

It works find for most of the log but as times goes by, there are lots of useragent have strange " inside which makes the regular expression failed. I copied some of the problematic user agent below. Glad if someone could help to fix my regular expression so that it also works on the below strange scenario:-

"Mozilla/5.0 (Linux; U; Android 4.1.1; tr-tr; PIRANHA BUSINESS TAB 7"" Build/JRO03C) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30 GSA/2.0.0.392829"

"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; xhcueef7$#$% fjidf87jcnuFfFJH6@@jjfidjcu%09348%""=""IEAK)"

最满意答案

它看起来像报价用引号转义。 即

如果"字符紧跟另一个字符后" ,则应视为单个"用户代理的值”。

尝试改变这个:

"{1,3}(?P<useragent>[^"]*)"{1,3}

正则表达式可视化

对此:

"(?P<useragent>(?:[^"]""|[^"])*)"

正则表达式可视化

It looks like quotes are escaped with quotes. i.e.

If the " character is immediately followed by another " that should be treated as a single " for the value of the user agent.

Try changing this:

"{1,3}(?P<useragent>[^"]*)"{1,3}

Regular expression visualization

To this:

"(?P<useragent>(?:[^"]""|[^"])*)"

Regular expression visualization

更多推荐

本文发布于:2023-07-29 20:13:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1319428.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:组合   字符串   奇怪   用户   日志

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!