简体   繁体   English

re.findall() 函数返回空列表

[英]re.findall() function returning empty list

I have the following code:我有以下代码:

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    matches = findall(pattern, line)

print(matches)

I have checked that the pattern works and matches with the strings found in the html file.我已检查该模式是否有效并与 html 文件中的字符串匹配。 However, the findall() function still returns an empty list.但是, findall() 函数仍然返回一个空列表。 Is there something I've done wrong here?我在这里做错了什么吗?

EDIT: An error was pointed out and I fixed it.编辑:指出了一个错误,我修复了它。 The matches list still is empty once the code is run.一旦代码运行,匹配列表仍然是空的。

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    if findall(pattern, line) != []:
        matches.append(findall(pattern, line))

print(matches)

Here is less code which produces the same problem.这是产生相同问题的较少代码。 Hope this helps希望这可以帮助

matches = []
with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

matches = findall("^\s\s<strong>.*</strong>$", content)

print(matches)

Source HTML: view-source: https://spotifycharts.com/regional/au/daily/latest源 HTML:查看源: https : //spotifycharts.com/regional/au/daily/latest

Using regex to parse HTML is like using a baseball bat to clean someone's teeth.使用正则表达式解析 HTML 就像使用棒球棒清洁某人的牙齿一样。 Baseball bats are nice tools but they solve different problems than dental scalers.棒球棒是不错的工具,但它们解决的问题与洁牙机不同。

Python has an HTML parser called BeautifulSoup which you can install with pip install beautifulsoup4 : Python 有一个名为 BeautifulSoup 的 HTML 解析器,您可以使用pip install beautifulsoup4

>>> import requests
>>> from bs4 import BeautifulSoup
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> bs = BeautifulSoup(html)
>>> [e.text for e in bs.select(".chart-table-track strong")][:3]
['WAP (feat. Megan Thee Stallion)', 'Mood (feat. Iann Dior)', 'Head & Heart (feat. MNEK)']

Here we use a CSS selector ".chart-table-track strong" to extract all of the song titles (I assume that's the data you want...).在这里,我们使用 CSS 选择器".chart-table-track strong"来提取所有歌曲名称(我假设这是您想要的数据......)。


Another approach is to use Pandas:另一种方法是使用 Pandas:

>>> import pandas as pd
>>> import requests # not needed if you have html5lib
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> df = pd.read_html(html)[0]
>>> df[["Track", "Artist"]] = df["Track"].str.split("  by ", expand=True)
>>> df.drop(columns=df.columns[[0, 1, 2]])
                                Track  Streams      Artist
0     WAP (feat. Megan Thee Stallion)   311167     Cardi B
1              Mood (feat. Iann Dior)   295922    24kGoldn
2           Head & Heart (feat. MNEK)   190025  Joel Corry
3    Savage Love (Laxed - Siren Beat)   163776   Jawsh 685
4                         Breaking Me   150560       Topic
..                                ...      ...         ...
195                           Daisies    31092  Katy Perry
196                                21    31088      Polo G
197                     Nobody's Love    31047    Maroon 5
198        Ballin' (with Roddy Ricch)    30862     Mustard
199          Dancing in the Moonlight    30853   Toploader

[200 rows x 3 columns]

I expect that there is other stuff on the lines you're trying to match.我希望您尝试匹配的线路上还有其他内容。 Your expression only allows for EXACTLY a pair of begin/end tags on a line, with stuff between them, but nothing before or after them on the same line.您的表达式只允许一行上有一对开始/结束标记,它们之间有内容,但在同一行之前或之后没有任何内容。 I bet you want to use this expression:我打赌你想使用这个表达式:

"\s\s<strong>.*?</strong>"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM