简体   繁体   中英

re.findall() function returning empty list

I have the following code:

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    matches = findall(pattern, line)

print(matches)

I have checked that the pattern works and matches with the strings found in the html file. However, the findall() function still returns an empty list. Is there something I've done wrong here?

EDIT: An error was pointed out and I fixed it. The matches list still is empty once the code is run.

pattern = re.compile(r"^\s\s<strong>.*</strong>$")
matches = []

with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

for line in content:
    if findall(pattern, line) != []:
        matches.append(findall(pattern, line))

print(matches)

Here is less code which produces the same problem. Hope this helps

matches = []
with open(r"spotifycharts.html", "rt") as current_file:
    content = current_file.read()

matches = findall("^\s\s<strong>.*</strong>$", content)

print(matches)

Source HTML: view-source: https://spotifycharts.com/regional/au/daily/latest

Using regex to parse HTML is like using a baseball bat to clean someone's teeth. Baseball bats are nice tools but they solve different problems than dental scalers.

Python has an HTML parser called BeautifulSoup which you can install with pip install beautifulsoup4 :

>>> import requests
>>> from bs4 import BeautifulSoup
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> bs = BeautifulSoup(html)
>>> [e.text for e in bs.select(".chart-table-track strong")][:3]
['WAP (feat. Megan Thee Stallion)', 'Mood (feat. Iann Dior)', 'Head & Heart (feat. MNEK)']

Here we use a CSS selector ".chart-table-track strong" to extract all of the song titles (I assume that's the data you want...).


Another approach is to use Pandas:

>>> import pandas as pd
>>> import requests # not needed if you have html5lib
>>> html = requests.get("https://spotifycharts.com/regional/au/daily/latest").text
>>> df = pd.read_html(html)[0]
>>> df[["Track", "Artist"]] = df["Track"].str.split("  by ", expand=True)
>>> df.drop(columns=df.columns[[0, 1, 2]])
                                Track  Streams      Artist
0     WAP (feat. Megan Thee Stallion)   311167     Cardi B
1              Mood (feat. Iann Dior)   295922    24kGoldn
2           Head & Heart (feat. MNEK)   190025  Joel Corry
3    Savage Love (Laxed - Siren Beat)   163776   Jawsh 685
4                         Breaking Me   150560       Topic
..                                ...      ...         ...
195                           Daisies    31092  Katy Perry
196                                21    31088      Polo G
197                     Nobody's Love    31047    Maroon 5
198        Ballin' (with Roddy Ricch)    30862     Mustard
199          Dancing in the Moonlight    30853   Toploader

[200 rows x 3 columns]

I expect that there is other stuff on the lines you're trying to match. Your expression only allows for EXACTLY a pair of begin/end tags on a line, with stuff between them, but nothing before or after them on the same line. I bet you want to use this expression:

"\s\s<strong>.*?</strong>"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM