用于解析HTML文档的Python Regex表达式

Question

https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue

I am trying to find the names of the companies in order of revenue. 我试图按收入顺序找到公司的名称。 It's a bit challenging because the titles all have differently formatted tags. 这有点挑战性，因为标题都有不同格式的标签。 If anyone could come up with a solution I'd be very grateful. 如果有人能提出解决方案，我将非常感激。

An example of my problem: 我的问题的一个例子：

I'd like to match "Wal-Mart Stores Inc." 我想要与“沃尔玛商店公司”相匹配 and then "Sinopec Group" and so forth in order. 然后“中石化集团”等等。

<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>

...further in the document... ...进一步在文件中......

<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>

Thanks in advance. 提前致谢。

Answer 1

Group the content of the title attribute in a tags. 集团的内容title的属性a标签。 It checks if it's the first table cell after the ranking. 它检查它是否是排名后的第一个表格单元格。

regex = /th>\n<td.*?><a .* ?title="(.*?)".*>/

It's known to work currently . 众所周知，它目前正在工作。 But it's a fairly brittle method. 但这是一种相当脆弱的方法。 Check the Online Regex Tester for regex details information 查看在线Regex Tester以获取正则表达式详细信息

Answer 2

This can be done easily with beautifulsoup 这可以通过beautifulsoup轻松完成

from bs4 import BeautifulSoup as soup

x = ['<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>', '<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>']
tmp = [soup(y).find('td').find('a') for y in x]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

If its a single string, then you can use 如果是单个字符串，则可以使用

x = '''<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td> <td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>'''
tmp = [y.find('a') for y in soup(x).find_all('td')]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

If you still want to use regex, then 如果您仍想使用正则表达式，那么

<td.*?<a.*? title\s*=\s*"([^"]+).*?</td>

NOTE :- Match in first capturing group 注意： - 匹配第一个捕获组

Regex Demo 正则表达式演示

用于解析HTML文档的Python Regex表达式

问题描述

2 个解决方案

解决方案1
0 2016-05-26 03:10:57

解决方案2
0 已采纳 2016-05-26 03:16:03

用于解析HTML文档的Python Regex表达式

问题描述

2 个解决方案

解决方案1 0 2016-05-26 03:10:57

解决方案2 0 已采纳 2016-05-26 03:16:03

解决方案1
0 2016-05-26 03:10:57

解决方案2
0 已采纳 2016-05-26 03:16:03