如何删除分析的XML文本中的'BODY'标签?(How to remove 'BODY' tag in parsed xml text?)
我是一个新手程序员。 我使用python 3和BeautifulSoup4解析了一些xml文件时遇到了问题。 也就是说,解析文本显示为
"BODY { MARGIN: 0px; FONT-FAMILY: Malgun Gothic; COLOR: #000000; FONT-SIZE: 10pt}P { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px}LI { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px} blar - blar - blar "'blar - blar - blar'是我想要解析的文本。
我如何删除该文本中无用的单词?
I'm a novice programmer. I got a problem with parsing some xml files using python 3 and BeautifulSoup4. That is, Parsed text is shown as
"BODY { MARGIN: 0px; FONT-FAMILY: Malgun Gothic; COLOR: #000000; FONT-SIZE: 10pt}P { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px}LI { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px} blar - blar - blar "'blar - blar - blar' is the text what i want to parse.
How can i remove that useless words in that text?
最满意答案
我会用这个正则表达式。 如果您缩小了想要缩小的字符串格式,可以创建更好的正则表达式。
import re text = "BODY { MARGIN: 0px; FONT-FAMILY: Malgun Gothic; COLOR: #000000; FONT-SIZE: 10pt}P { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px}LI { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px} blar - blar - blar" print (re.findall("(?:(?:(.*?)}){3})(.*)",text)[0][1])这里有一个regex101让你看看:
https://regex101.com/r/m0Q3hL/1
I'd use regex for this. If you narrowed the formatting of the string you want down a bit, you could create a nicer regex.
import re text = "BODY { MARGIN: 0px; FONT-FAMILY: Malgun Gothic; COLOR: #000000; FONT-SIZE: 10pt}P { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px}LI { LINE-HEIGHT: 1.2; MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px} blar - blar - blar" print (re.findall("(?:(?:(.*?)}){3})(.*)",text)[0][1])Here's a regex101 for you to look at:
https://regex101.com/r/m0Q3hL/1
更多推荐
发布评论