我使用下面的正则表达式来解析组合日志格式: -
^(?P<client>\S+) (?P<identd>\S+) (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? (?P<version>HTTP/[0-9.]+)" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "{1,3}(?P<useragent>[^"]*)"{1,3} "(?P<cookie>[^"]*)"它适用于大多数日志,但随着时间的推移,有很多useragent有奇怪的“内部使正则表达式失败。我复制了下面的一些有问题的用户代理。很高兴,如果有人可以帮助修复我的正则表达式所以它也适用于以下奇怪的情况: -
“Mozilla / 5.0(Linux; U; Android 4.1.1; tr-tr; PIRANHA BUSINESS TAB 7”“Build / JRO03C)AppleWebKit / 534.30(KHTML,与Gecko一样)Version / 4.0 Safari / 534.30 GSA / 2.0.0.392829”
“Mozilla / 4.0(兼容; MSIE 8.0; Windows NT 5.1; Trident / 4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; xhcueef7 $#$%fjidf87jcnuFfFJH6 @@ jjfidjcu%09348%”“=”“IEAK )”
I use the below regular expression to parse combined log format:-
^(?P<client>\S+) (?P<identd>\S+) (?P<userid>\S+) \[(?P<datetime>[^\]]+)\] "(?P<method>[A-Z]+) (?P<request>[^ "]+)? (?P<version>HTTP/[0-9.]+)" (?P<status>[0-9]{3}) (?P<size>[0-9]+|-) "(?P<referrer>[^"]*)" "{1,3}(?P<useragent>[^"]*)"{1,3} "(?P<cookie>[^"]*)"It works find for most of the log but as times goes by, there are lots of useragent have strange " inside which makes the regular expression failed. I copied some of the problematic user agent below. Glad if someone could help to fix my regular expression so that it also works on the below strange scenario:-
"Mozilla/5.0 (Linux; U; Android 4.1.1; tr-tr; PIRANHA BUSINESS TAB 7"" Build/JRO03C) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30 GSA/2.0.0.392829"
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; xhcueef7$#$% fjidf87jcnuFfFJH6@@jjfidjcu%09348%""=""IEAK)"
最满意答案
它看起来像报价用引号转义。 即
如果"字符紧跟另一个字符后" ,则应视为单个"用户代理的值”。
尝试改变这个:
"{1,3}(?P<useragent>[^"]*)"{1,3}对此:
"(?P<useragent>(?:[^"]""|[^"])*)"It looks like quotes are escaped with quotes. i.e.
If the " character is immediately followed by another " that should be treated as a single " for the value of the user agent.
Try changing this:
"{1,3}(?P<useragent>[^"]*)"{1,3}To this:
"(?P<useragent>(?:[^"]""|[^"])*)"更多推荐
发布评论