[英]Python Regex expression for parsing HTML document
https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue
I am trying to find the names of the companies in order of revenue. 我试图按收入顺序找到公司的名称。 It's a bit challenging because the titles all have differently formatted tags.
这有点挑战性,因为标题都有不同格式的标签。 If anyone could come up with a solution I'd be very grateful.
如果有人能提出解决方案,我将非常感激。
An example of my problem: 我的问题的一个例子:
I'd like to match "Wal-Mart Stores Inc." 我想要与“沃尔玛商店公司”相匹配 and then "Sinopec Group" and so forth in order.
然后“中石化集团”等等。
<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>
...further in the document... ...进一步在文件中......
<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>
Thanks in advance. 提前致谢。
Group the content of the title
attribute in a
tags. 集团的内容
title
的属性a
标签。 It checks if it's the first table cell after the ranking. 它检查它是否是排名后的第一个表格单元格。
regex = /th>\n<td.*?><a .* ?title="(.*?)".*>/
It's known to work currently . 众所周知,它目前正在工作。 But it's a fairly brittle method.
但这是一种相当脆弱的方法。 Check the Online Regex Tester for regex details information
查看在线Regex Tester以获取正则表达式详细信息
This can be done easily with beautifulsoup
这可以通过
beautifulsoup
轻松完成
from bs4 import BeautifulSoup as soup
x = ['<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>', '<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>']
tmp = [soup(y).find('td').find('a') for y in x]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)
If its a single string, then you can use 如果是单个字符串,则可以使用
x = '''<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td> <td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>'''
tmp = [y.find('a') for y in soup(x).find_all('td')]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)
If you still want to use regex, then 如果您仍想使用正则表达式,那么
<td.*?<a.*? title\s*=\s*"([^"]+).*?</td>
NOTE :- Match in first capturing group 注意 : - 匹配第一个捕获组
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.