Python 链接抓取器正则表达式仅在搜索 1 种扩展类型时有效，但在匹配多种扩展类型时失败

Question

This is the test link I am using for this project: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0这是我用于该项目的测试链接： https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0

Now, the code below works when matching for.mp3 only (line 8), and outputs the plain link to a text file as asked.现在，下面的代码仅在匹配 for.mp3 时有效（第 8 行），并按要求将纯链接输出到文本文件。

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.mp3')

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link + '\n')

The issue is, the test link above contains not only.mp3's but also.flac, and.wav.问题是，上面的测试链接不仅包含.mp3，还包含.flac 和.wav。

When I change the code (line 8) to the following to scrape and return all links containing those extensions above (.mp4, .flac, .wav), it outputs a text file with "mp3", "flac" and "wav".当我将代码（第 8 行）更改为以下内容以抓取并返回包含上述扩展名（.mp4、.flac、.wav）的所有链接时，它会输出一个包含“mp3”、“flac”和“wav”的文本文件. No links.没有链接。

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.(mp3|flac|wav)')

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link + '\n')

I've been trying to understand where the error is.我一直在试图了解错误在哪里。 Regex, or something else?正则表达式，还是其他什么？ I can't figure it out.我想不通。

Thank you.谢谢。

Answer 1

That's because in your second code, you capture ( with parenthesis) only the extension of the url/file.那是因为在你的第二个代码中，你只捕获（带括号） url/文件的扩展名。 So, one way to fix that, is to add another capturing group like below ( read comments ):因此，解决该问题的一种方法是添加另一个捕获组，如下所示（阅读评论）：

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'(http[s]?://[^\s]+\.(mp3|flac|wav))') # <- added outer parenthesis

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link[0] + '\n') # <- link[0] instead of link[0]

Output: Output：

Answer 2

Another way to solve this is to have non capturing groups with ?:解决这个问题的另一种方法是使用非捕获组?:

pattern = re.compile(r'http[s]?://[^\s]+\.(?:mp3|flac|wav)')

See here .看这里。

Python 链接抓取器正则表达式仅在搜索 1 种扩展类型时有效，但在匹配多种扩展类型时失败

问题描述

2 个解决方案

解决方案1
0 已采纳 2023-02-01 20:44:20

解决方案2
0 2023-02-01 21:07:03

Python 链接抓取器正则表达式仅在搜索 1 种扩展类型时有效，但在匹配多种扩展类型时失败

问题描述

2 个解决方案

解决方案1 0 已采纳 2023-02-01 20:44:20

解决方案2 0 2023-02-01 21:07:03

解决方案1
0 已采纳 2023-02-01 20:44:20

解决方案2
0 2023-02-01 21:07:03