Python RegEx跳过前几个字符？

Question

Hey I have a fairly basic question about regular expressions. 嘿，我对正则表达式有一个相当基本的问题。 I want to just return the text inside (and including) the body tags, and I know the following isn't right because it'll also match all the characters before the opening body tag. 我只想返回正文标签中（包括正文）的文字，并且我知道以下内容不正确，因为它还会匹配正文标签前的所有字符。 I was wondering how you would go about skipping those? 我想知道您将如何跳过这些？

x = re.match('(.*<body).*?(</body>)', fileString)

Thanks! 谢谢！

Answer 1

I don't know Python, but here's a quick example thrown together using Beautiful Soup , which I often see recommended for Python HTML parsing. 我不了解Python，但这是一个使用Beautiful Soup的简单示例，我经常看到推荐将其用于Python HTML解析。

import BeautifulSoup

soup = BeautifulSoup(fileString)

bodyTag = soup.html.body.string

That will (in theory) deal with all the complexities of HTML, which is very difficult with pure regex-based answers, because it's not what regex was designed for. 从理论上讲，这将处理HTML的所有复杂性，这对于基于纯正则表达式的答案非常困难，因为这不是正则表达式的目的。

Answer 2

Here is some example code which uses regex to find all the text between <body>...</body> tags. 这是一些使用正则表达式查找<body>...</body>标记之间的所有文本的示例代码。 Although this demonstrates some features of python's re module, note that the Beautiful Soup module is very easy to use and is a better tool to use if you plan on parsing HTML or XML. 尽管这展示了python re模块的某些功能，但请注意， Beautiful Soup模块非常易于使用，如果计划解析HTML或XML，则它是更好的工具。 (See below for an example of how you could parse this using BeautifulSoup.) （有关如何使用BeautifulSoup解析此示例的示例，请参见下文。）

#!/usr/bin/env python
import re

# Here we have a string with a multiline <body>...</body>
fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''

# re.DOTALL tells re that '.' should match any character, including newlines.
x = re.search('(<body>.*?</body>)', fileString, re.DOTALL)
for match in x.groups():
    print(match)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

If you wish to collect all matches, you could use re.findall: 如果您希望收集所有匹配项，则可以使用re.findall：

print(re.findall('(<body>.*?</body>)', fileString, re.DOTALL))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

and if you plan to use this pattern more than once, you can pre-compile it: 如果您打算多次使用此模式，则可以对其进行预编译：

pat=re.compile('(<body>.*?</body>)', re.DOTALL)
print(pat.findall(fileString))
# ['<body>foo\nbaby foo\nbaby foo\nbaby foo\n</body>', '<body>bar</body>']

And here is how you could do it with BeautifulSoup: 这是使用BeautifulSoup的方法：

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

fileString='''baz<body>foo
baby foo
baby foo
baby foo
</body><body>bar</body>'''
soup = BeautifulSoup(fileString)
print(soup.body)
# <body>foo
# baby foo
# baby foo
# baby foo
# </body>

print(soup.findAll('body'))
# [<body>foo
# baby foo
# baby foo
# baby foo
# </body>, <body>bar</body>]

Answer 3

You cannot parse HTML with regex. 您无法使用正则表达式解析HTML。 HTML is not a regular language. HTML不是常规语言。 Use an HTML parser like lxml instead. 使用类似lxml的HTML解析器。

Answer 4

 x = re.match('.*(<body>.*?</body>)', fileString)

考虑最小化HTML解析。

Answer 5

x = re.search('(<body>.*</body>)', fileString)
x.group(1)

Less typing than the match answers 打字比比赛答案少

Answer 6

Does your fileString contain multiple lines ? 您的fileString是否包含多行？ In that case you may need to specify it or skip the lines explicitly: 在这种情况下，您可能需要指定它或显式跳过以下行：

x = re.match(r"(?:.|\n)*(<body>(?:.|\n)*</body>)", fileString)

or, more simply with the re module: 或者，更简单地说，使用re模块：

x = re.match(r".*(<body>.*</body>)", fileString, re.DOTALL)

x.groups()[0] should contain your string if x is not None. 如果x不为None，则x.groups()[0]应包含您的字符串。

Python RegEx跳过前几个字符？

问题描述

6 个解决方案

解决方案1
9 2009-10-25 13:32:09

解决方案2
2 已采纳 2009-10-25 13:18:43

解决方案3
0 2009-10-25 15:50:23

解决方案4
-1 2009-10-25 13:18:22

解决方案5
-1 2009-10-25 13:25:40

解决方案6
-1 2009-10-25 13:41:02

Python RegEx跳过前几个字符？

问题描述

6 个解决方案

解决方案1 9 2009-10-25 13:32:09

解决方案2 2 已采纳 2009-10-25 13:18:43

解决方案3 0 2009-10-25 15:50:23

解决方案4 -1 2009-10-25 13:18:22

解决方案5 -1 2009-10-25 13:25:40

解决方案6 -1 2009-10-25 13:41:02

解决方案1
9 2009-10-25 13:32:09

解决方案2
2 已采纳 2009-10-25 13:18:43

解决方案3
0 2009-10-25 15:50:23

解决方案4
-1 2009-10-25 13:18:22

解决方案5
-1 2009-10-25 13:25:40

解决方案6
-1 2009-10-25 13:41:02