[英]Python Regex to findall lines contains specific type of filenames
我有一個文本文件。 我想僅在文件名是.doc或.pdf類型的文件時獲取包含文件名的行。
例如,
<TR><TD ALIGN="RIGHT">4.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD>
</TR>
<TR><TD ALIGN="RIGHT">5.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD>
</TR>
使用python re.findall()
我想獲得以下幾行。
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>
誰能告訴我在re.findall()中定義模式的任何可擴展方式?
您可以使用此正則表達式:
(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)
輸出:
>>> html = """<TR><TD ALIGN="RIGHT">4.</TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD>
... </TR>
... <TR><TD ALIGN="RIGHT">5.</TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD>
... </TR>"""
>>> re.findall("(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)", html)
['<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>', '<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>']
像這樣:
>>> strs="""<TR><TD ALIGN="RIGHT">4.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD>
</TR>
<TR><TD ALIGN="RIGHT">5.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD>
</TR>"""
>>> [x for x in strs.splitlines() if re.search(r"[a-zA-Z0-9]+\.(pdf|doc)",x)]
['<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>',
'<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>'
]
您可以同時使用BeautifulSoup
和re
。
import BeautifulSoup
import re
lines = soup.findAll('href', text = re.compile('your regex here'), attrs = {'class' : 'text'})
使用html代碼中的上層標頭class
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.