[英]Get two or more consecutive capitalized words from html text using regex
我試圖提取兩個或多個單詞的所有序列,其中每個單詞的第一個字母大寫。 我認為這個'[AZ][az]+(?=\\s[AZ])(?:\\s[AZ][az]+)+'
會起作用,但它添加了我無法解釋的字符。
這是完整的代碼:
import re
import unittest
from bs4 import BeautifulSoup
html_page = """
<html>
<body>
<table>
<tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>
<tr class=tb1><td>Consectetuer adipiscing elit</td></tr>
<tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>
<tr><td>Vestibulum Auctor Dapibus neque</td></tr>
</table>
</body>
</html>
"""
soup = BeautifulSoup(html_page)
text = soup.get_text()
def get_sequences(page):
ex = re.compile('[A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+')
sequences = re.findall(ex, page)
return sequences
print get_sequences(text)
想要的結果應該是['Lorem Ipsum', 'Aliquam Tincidunt', 'Vestibulum Auctor Dapibus']
但相反,我得到了[u'Lorem Ipsum', u'Aliquam Tincidunt', u' Risus\\nVestibulum Auctor Dapibus']
方法是正確的,但沒有指導性。 您要查找的是一行中兩個或多個連續大寫的單詞。 因此,您應該在文本中的行上運行正則表達式。 這是訣竅:
def get_sequences(page):
ex = re.compile('[A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+')
sequences = []
for x in page.splitlines():
sequences.append(re.findall(ex, x))
sequences = sum(sequences,[])
return sequences
蟒蛇代碼:
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"[A-Z][a-z]+\s+[A-Z][a-z]+"
test_str = ("<html>\n"
"<body>\n"
"<table>\n"
"<tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>\n"
"<tr class=tb1><td>Consectetuer adipiscing elit</td></tr>\n"
"<tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>\n"
"<tr><td>Vestibulum Auctor Dapibus neque</td></tr>\n"
"</table>\n"
"</body>\n"
"</html>\n"
"\"\"\"")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print (match.group())
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
結果:
Lorem Ipsum
Aliquam Tincidunt
Vestibulum Auctor
見: http : //ideone.com/iQev8D
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.