用于捕获最小组的正则表达式(Regex for capturing smallest group)

编程入门行业动态更新时间:2024-10-23 22:34:04

我正在尝试捕获PDF 页面对象的ID，如下所示：

4 0 obj << /Type /Page / ... >> endobj

ID是' ID 0 obj'。问题是我的文件有多个对象，因此以下模式从第一个对象声明捕获到Page对象的第一个实例：

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);

以下是我的文件示例，如果您想尝试一下，您会看到包含单词“Page”的多个对象：

%PDF-1.3 %¦¦¦¦ 1 0 obj << /Type /Catalog /AcroForm << /Fields [12 0 R 13 0 R] /NeedAppearances false /SigFlags 3 /Version /1.7 /Pages 3 0 R /Names << >> /ViewerPreferences << /Direction /L2R >> /PageLayout /SinglePage /PageMode /UseNone /OpenAction [0 0 R /FitH null] /DR << /Font << /F1 14 0 R >> >> /DA (/F1 0 Tf 0 g) /Q 0 >> /Perms << /DocMDP 11 0 R >> /Outlines 2 0 R /Pages 3 0 R >> endobj 2 0 obj << /Type /Outlines /Count 0 >> endobj 3 0 obj << /Type /Pages /Count 2 /Kids [ 4 0 R 6 0 R ] >> endobj 4 0 obj << /Type /Page /Parent 3 0 R /Resources << /Font << /F1 9 0 R >> /ProcSet 8 0 R >> /MediaBox [0 0 612.0000 792.0000] /Contents 5 0 R >> endobj 5 0 obj << /Length 1074 >> stream 2 J BT 0 0 0 rg /F1 0027 Tf 57.3750 722.2800 Td ( A Simple PDF File ) Tj ET BT /F1 0010 Tf

我应该改变什么才不让它变得贪婪？

编辑：澄清

我忘了提到我需要捕获所有的Page对象ID。有些人告诉我使用更具体的正则表达式，我不得不说这不是一个关于如何构建对象的正式例子，这也是可能的。您可以看到空格不是修饰的，并且在页面'/类型/页面'标记之前可以有多个标记。

示例：

4 0 obj << /UselessTag/Type/Page/ ... >> endobj 有一些名为Pages ， PageLayout ， SiglePage的标签，我不想捕捉它们。

I am trying to capture an ID for a PDF Page object that looks like this :

4 0 obj << /Type /Page / ... >> endobj

The ID is this 'ID 0 obj'. The problem is that my file has multiple objects and so the following pattern captures from the first object declaration to the first instance of a Page object :

preg_match_all("/([0-9]+) 0 obj.+?\/Page[ \n]*?\//s", $input_lines, output_array);

Here is a sample of my file if you want to try it out, you will see that are multiple objects that include the word 'Page' :

What should I change to not make it greedy ?

EDIT : Clarifications

I forgot to mention that I need to capture all of the Page object IDs. As some people told me to use more specific regex, I have to say that this is not a formal example of how objects are build and this one is also possible. You can see that the spaces are not mendatory and that there can be multiple tags before the Page '/Type /Page' tag.

Example :

4 0 obj << /UselessTag/Type/Page/ ... >> endobj There are tags called Pages, PageLayout, SiglePage and I don't want to capture them.

最满意答案

你可以用

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

请参阅正则表达式演示

细节：

^ - 行锚的开始（因为m修饰符使^匹配行的开头而不是整个字符串） (\d+) 0 obj - 1个或更多个数字（捕获到组1中），然后是空格， 0 ，空格和obj子串 (?:(?!^\d+ 0 obj$).)*? - 一个淬火的贪婪令牌，匹配任何不启动^\d+ 0 obj$模式的char（ . ），尽可能少 \/Type\s*\/Page\s - /Type ，0 + whitespaces（将\s替换为\h以仅匹配水平空格）， /Page然后是空格 .*? - 任何0+字符尽可能少到第一次出现 endobj - endobj随后...... $ - 行尾位置。

You may use

'~^(\d+) 0 obj(?:(?!^\d+ 0 obj$).)*?\/Type\s*\/Page\s.*?endobj$~sm'

See the regex demo

Details:

^ - start of a line anchor (as m modifier makes ^ match start of a line and not of a whole string) (\d+) 0 obj - 1 or more digits (captured into Group 1), then space, 0, space and an obj substring (?:(?!^\d+ 0 obj$).)*? - a tempered greedy token that matches any char (.) that does not start a ^\d+ 0 obj$ pattern, as few times as possible \/Type\s*\/Page\s - /Type, 0+ whitespaces (replace \s with \h to only match horizontal whitespace), /Page and then a whitespace .*? - any 0+ chars as few as possible up to the first occurrence of endobj - endobj followed with... $ - the end of line position.

更多推荐

本文发布于:2023-08-04 10:44:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1414816.html