使用PCRE正则表达式从文本中解析电子邮件标头(Parse email header from text using PCRE regular expression)

我需要解析（拆分）包含从Outlook导出的电子邮件的文本文件。我使用preg_split与PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE分割它 PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE

我的目标是使用正则表达式捕获邮件头部分，即从“From：”行开始，以邮件正文之前的空行结束。

约束：

预期多语言字段名称标题字段数量不同（CC，BCC，附件）某些字段可能位于多行（To，CC，BCC，Subject，Attachments）

预处理文本文件：用单个空格替换多个空格和制表符，替换前导和尾随空格。

我一整天都在这里，不能让最后一部分工作。它适用于[gskinner regex测试页面]： http ：//regexr.com？36v27，但不适用于php。

学科：

From: Black, Jack (LA) Sent: Monday, October 28, 2013 6:36 PM To: George, Jackson (London); DCS.CC.DARWIN (Australia) Cc: Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo, John J. (Gaziantep); Matrix, John (Phuket) Subject: RE: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR Dear George, venenatis imperdiet quam. Proin a egestas nunc, et mattis elit. In hac habitasse platea dictumst. Nulla dolor nibh, tempus ut neque eu, tempus fermentum mauris. Mauris nec ipsum nec sapien commodo scelerisque ut eu urna. Pellentesque eu neque in enim adipiscing faucibus. Sed interdum arcu et sem mollis iaculis. Duis euismod laoreet ligula lacinia dapibus. Vestibulum ullamcorper malesuada metus at malesuada. Nullam enim elit, auctor vehicula orci eget, imperdiet feugiat odio. Etiam dapibus sagittis sem a varius. Nulla sit amet convallis mi, sit amet rutrum ipsum. In libero lectus, mattis at dui eu. Thank you and best regards, Jack B. Black (Mr) Operations Manager (GGD) FU Supervisor (R34, R57) Phone: +1112212212 (local 1111) Mobile: +12 121.111.11.12 From: George, Jackson (UK) Sent: Monday, October 28, 2013 5:57 PM To: DCS.CC.DARWIN (Australia) Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo, John J. (Gaziantep); Matrix, John (Phuket) Subject: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR Dear Colleagues, ermentum. Duis ipsum quam, bibendum a risus nec, tincidunt fringilla lectus. Nunc vel dictum massa, et cursus nunc. Mauris tincidunt felis eget justo congue volutpat. Nulla condimentum accumsan elementum. Integer commodo, lorem eu pharetra suscipit, ligula. Best Regards. SDFD srfgGD Field coordinator (GGD) Customer Representative sds dfsd sdfgsef sdfsd sgzdfgdfg fgfg gdfg Footer text etc sdfdfdf dfgsdfgsdfgsdfg Phone : +90 212 368 40 00 (ext:3814)

正则表达式：

preg_match( '/ # delimiter ( # capturing group start [\ A-Z][a-z]+:.+$.+$\R # From: field [A-Z][a-z]+:.+\R # Sent: fields [A-Z][a-z]+:.+\R # To: field (1st line) (?:.+\R)+ # any additional header lines, before blank line (To, CC, BCC, Subject, Attachments) ) # capturing group end # delimiter + modifiers /x',$text_clean, $matches); echo '<b>Matches: '.count($matches).'</b>'; print_r($matches);

我在获取其他标题行时遇到问题：

(?:.+\R)+ # any additional header lines...

任何帮助表示赞赏

I need to parse (split) a text file containing emails exported from Outlook. I am splitting it using preg_split with PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE

My goal is to capture message header section with the regular expression, i.e. starting from "From:" line and ending with the blank line before message body.

Constraints:

Multilingual field names expected Number of header fields varies (CC, BCC, Attachments) Some fields may be on more than one line (To, CC, BCC, Subject, Attachments)

the text file is pre-treated: replace multiple spaces and tabs with single space, replace leading and trailing spaces.

I have been at it whole day, cannot get the last part to work. It does work on [gskinner regex testing page]: http://regexr.com?36v27 , but not in php.

Subject:

From: Black, Jack (LA) Sent: Monday, October 28, 2013 6:36 PM To: George, Jackson (London); DCS.CC.DARWIN (Australia) Cc: Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo, John J. (Gaziantep); Matrix, John (Phuket) Subject: RE: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR Dear George, venenatis imperdiet quam. Proin a egestas nunc, et mattis elit. In hac habitasse platea dictumst. Nulla dolor nibh, tempus ut neque eu, tempus fermentum mauris. Mauris nec ipsum nec sapien commodo scelerisque ut eu urna. Pellentesque eu neque in enim adipiscing faucibus. Sed interdum arcu et sem mollis iaculis. Duis euismod laoreet ligula lacinia dapibus. Vestibulum ullamcorper malesuada metus at malesuada. Nullam enim elit, auctor vehicula orci eget, imperdiet feugiat odio. Etiam dapibus sagittis sem a varius. Nulla sit amet convallis mi, sit amet rutrum ipsum. In libero lectus, mattis at dui eu. Thank you and best regards, Jack B. Black (Mr) Operations Manager (GGD) FU Supervisor (R34, R57) Phone: +1112212212 (local 1111) Mobile: +12 121.111.11.12 From: George, Jackson (UK) Sent: Monday, October 28, 2013 5:57 PM To: DCS.CC.DARWIN (Australia) Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo, John J. (Gaziantep); Matrix, John (Phuket) Subject: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR Dear Colleagues, ermentum. Duis ipsum quam, bibendum a risus nec, tincidunt fringilla lectus. Nunc vel dictum massa, et cursus nunc. Mauris tincidunt felis eget justo congue volutpat. Nulla condimentum accumsan elementum. Integer commodo, lorem eu pharetra suscipit, ligula. Best Regards. SDFD srfgGD Field coordinator (GGD) Customer Representative sds dfsd sdfgsef sdfsd sgzdfgdfg fgfg gdfg Footer text etc sdfdfdf dfgsdfgsdfgsdfg Phone : +90 212 368 40 00 (ext:3814)

Regex:

preg_match( '/ # delimiter ( # capturing group start [\ A-Z][a-z]+:.+$.+$\R # From: field [A-Z][a-z]+:.+\R # Sent: fields [A-Z][a-z]+:.+\R # To: field (1st line) (?:.+\R)+ # any additional header lines, before blank line (To, CC, BCC, Subject, Attachments) ) # capturing group end # delimiter + modifiers /x',$text_clean, $matches); echo '<b>Matches: '.count($matches).'</b>'; print_r($matches);

I am having problem getting additional header lines:

(?:.+\R)+ # any additional header lines...

Any help is appreciated

最满意答案

最简单的方法是使用preg_match_all和一个惰性量词：

preg_match_all('/^From.*?\R\R/ims', $mails, $matches); print_r($matches);

Thanks everyone for input, however I figured it using my method. A few points are confusing to me, but working solution is further below.

Why preg_match returns first result twice instead of two matches:(http://www.ideone.com/Xj6aaF)1

(?:.+\R)+ The dot seems to match any character AND NO CHARACTERS, that's why it kept on missing blank lines. I finid that strange - isn't the + supposed to be 1 or more quantifier?

Anyway, when I changed my regex pattern to (?:\S.+\R)+ it does what I want using preg_split.

Demo

Though, technically my problem is solved, I would love someone to explain the above two points.

更多推荐