简体   繁体   English

Python Re.Search:如何在两个字符串之间找到 substring,该字符串还必须包含特定的 substring

[英]Python Re.Search: How to find a substring between two strings, that must also contain a specific substring

I am writing a little script to get my F@H user data from a basic HTML page.我正在编写一个小脚本来从基本的 HTML 页面获取我的 F@H 用户数据。

I want to locate my username on that page and the numbers before and after it.我想在该页面上找到我的用户名以及它之前和之后的数字。

All the data I want is between two HTML <tr> and </tr> tags.我想要的所有数据都在两个 HTML <tr></tr>标签之间。

I am currently using this:我目前正在使用这个:

re.search(r'<tr>(.*?)</tr>', htmlstring)

I know this works for any substring, as all google results for my question show.我知道这适用于任何 substring,因为我的问题的所有谷歌结果都显示。 The difference here is i need it only when that substring also contains a specific word这里的区别是我只需要它 substring 也包含一个特定的词

However that only returns the first string between those two delimiters, not even all of them.但是,这只返回这两个分隔符之间的第一个字符串,甚至不是全部。

This pattern occurs hundreds of times on the page.这种模式在页面上出现了数百次。 I suspect it doesn't get them all because I'm not handling all the newline characters correctly but I'm not sure.我怀疑它并没有全部得到它们,因为我没有正确处理所有换行符,但我不确定。

If it would return all of them, I could at least then sort them out to find one that contains my username going through each result.group() , but I can't even do that.如果它会返回所有这些,我至少可以将它们整理出来,找到一个包含我的用户名通过每个result.group()的用户名,但我什至不能这样做。

I have been fiddling with different regex expressions for ages now but can't figure what one I need to much frustration.多年来,我一直在摆弄不同的正则表达式,但无法弄清楚我需要什么,我感到非常沮丧。

TL;DR - I need a re.search() pattern that finds a substring between two words, that also contains a specific word. TL;DR - 我需要一个re.search()模式,它可以在两个单词之间找到一个 substring,它还包含一个特定的单词。

If I understand correctly something like this might work如果我理解正确,这样的事情可能会奏效
<tr>(?:(?:(?:(?.<\/tr>)?)*?)\bWORD\b(:.?*?))<\/tr>

  • <tr> find "<tr>" <tr>找到“<tr>”
  • (?:(?:(?.<\/tr>)?)*?) Find anything except "</tr>" as few times as possible (?:(?:(?.<\/tr>)?)*?)尽可能少地查找除 "</tr>" 之外的任何内容
  • \bWORD\b find WORD \bWORD\b查找 WORD
  • (?:.*?)) find anything as few times as possible (?:.*?))尽可能少地找到任何东西
  • <\/tr> find "</tr>" <\/tr>找到“</tr>”

Sample样本

There are a few ways to do it but I prefer the pandas way:有几种方法可以做到,但我更喜欢 pandas 方式:


from urllib import request

import pandas as pd # you need to install pandas

base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()

web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)

user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)

Then there are plenty of ways to do this.然后有很多方法可以做到这一点。 Using just Beautifulsoup find or css selectors, or maybe re as Peter suggest?仅使用 Beautifulsoup 查找或 css 选择器,或者可能像彼得建议的那样重新使用?

Using beautifulsoup and "find" method, and re, you can do it the following way:使用 beautifulsoup 和“查找”方法,然后重新,您可以通过以下方式进行操作:

import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request




base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()

page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)

user_name_to_find_in_table = 'SteveMoody'

row_tag = page_soup.find(
    lambda t: t.name == "td"
              and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")

print(row_tag.get_text().strip('tr'))

Using Beautifulsoup and CSS Selectors(no re but Beautifulsoup):使用 Beautifulsoup 和 CSS 选择器(没有重新但 Beautifulsoup):

from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request


base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()

page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)

user_name_to_find_in_table = 'SteveMoody'

row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')

print(row_tag.get_text().strip('tr'))

In your case I would favor the pandas example as you keep headers and can easily get other stats, and it runs very quickly.在您的情况下,我更喜欢 pandas 示例,因为您保留标题并且可以轻松获取其他统计信息,并且它运行得非常快。

Using Re:使用回复:

So fa, best input is Peters' comment Link , so I just adapted it to Python code (happy to get edited), as this solution doesn't need any extra libraries installation.到目前为止,最好的输入是 Peters 的评论Link ,所以我只是将它改编为 Python 代码(很高兴得到编辑),因为这个解决方案不需要任何额外的库安装。

import re
from urllib import request




base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'

web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))

print(res.group(0))


Helpful lin to use variables in regex: stackflow有用的 lin 在正则表达式中使用变量: stackflow

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM