获取确定格式的子字符串，但未确定的内容(Getting at a substring of determined format, but undetermined content)

我正在抓一个网站（不幸的是荷兰语）。我提取下面的代码段：

gewezen op het beroep in cassatie van de Staatssecretaris van Financiën tegen de uitspraak van het Gerechtshof Arnhem-Leeuwarden van 5 juli 2016, nr. 15/01196, op het door [X] te [Z] (hierna: belanghebbende) ingestelde hoger beroep tegen een uitspraak van de Rechtbank Gelderland (AWB 14/7184)

我想得到日期（5 juli 2016）和案件编号（编号15/01196）。由于我正在抓取数千页，因此我无法将其与精确的字符串匹配。日期可以是此格式的任何日期，数字可以是任何日期。日期的格式始终相同，请注意月份名称是荷兰语。并且数字的格式是XX / XXXX或XX / XXXXX，'nr'和数字之间也可以有额外的字母。该数字有时在括号/括号之间，有时在逗号之间，如上例所示。

所以输出应该是两个看起来像这样的列表：

date=[5 juli 2016] casenr=[nr. 15/01196] (or 15/01196)

在上面的示例中，您会看到另一组具有相似格式的数字（AWB 14/7184）。但是，我知道我需要的数字总是第一个提到这种格式的数字。日期也是任何片段中提到的唯一日期。

基于这样的宽松条件，有没有办法获得这个输出？如果它们总是在逗号之间，那会更容易吗？

I am scraping a website (unfortunately in Dutch). I extracted the snippet below:

gewezen op het beroep in cassatie van de Staatssecretaris van Financiën tegen de uitspraak van het Gerechtshof Arnhem-Leeuwarden van 5 juli 2016, nr. 15/01196, op het door [X] te [Z] (hierna: belanghebbende) ingestelde hoger beroep tegen een uitspraak van de Rechtbank Gelderland (AWB 14/7184)

I want to get the date (5 juli 2016) and the case number (nr. 15/01196). Since I am scraping thousands of pages I can't have it match an exact string. The date could be any date in this format and the number could be anything. The format of the date is always the same, note that the month name is in Dutch. And the format of the number is either XX/XXXX or XX/XXXXX there can also be extra letters between 'nr' and the number. The number is sometimes between brackets/parentheses and sometimes between commas as in the example above.

So the output should be two lists that look like this:

date=[5 juli 2016] casenr=[nr. 15/01196] (or 15/01196)

In the above example you see another set of numbers with a similar format (AWB 14/7184). However, I know for a fact that the number I need is always the first one in this format to be mentioned. The date is also the only date mentioned in any of the snippets.

Is there a way to get to this output, based on such loose conditions ? If they were always between commas, would it be easier ?

最满意答案

你可以使用正则表达式。

import re text = u"""gewezen op het beroep in cassatie van de Staatssecretaris van Financiën tegen de uitspraak van het Gerechtshof Arnhem-Leeuwarden van 5 juli 2016, nr. 15/01196, op het door [X] te [Z] (hierna: belanghebbende) ingestelde hoger beroep tegen een uitspraak van de Rechtbank Gelderland (AWB 14/7184)""" # Assuming the number always follows the date m = re.search("(\d+\s+[a-z]+\s+\d+).*?(\d+\/\d+)", text, re.I) if m: print m.groups() # ('5 juli 2016', '15/01196') print m.group(1) # 5 juli 2016 print m.group(2) # 15/01196

You could use regex for this.

import re text = u"""gewezen op het beroep in cassatie van de Staatssecretaris van Financiën tegen de uitspraak van het Gerechtshof Arnhem-Leeuwarden van 5 juli 2016, nr. 15/01196, op het door [X] te [Z] (hierna: belanghebbende) ingestelde hoger beroep tegen een uitspraak van de Rechtbank Gelderland (AWB 14/7184)""" # Assuming the number always follows the date m = re.search("(\d+\s+[a-z]+\s+\d+).*?(\d+\/\d+)", text, re.I) if m: print m.groups() # ('5 juli 2016', '15/01196') print m.group(1) # 5 juli 2016 print m.group(2) # 15/01196

更多推荐