简体   繁体   English

Python 链接抓取器正则表达式仅在搜索 1 种扩展类型时有效,但在匹配多种扩展类型时失败

[英]Python link scraper regex works when only searching for 1 extension type, but fails when matching more than one extension type

This is the test link I am using for this project: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0这是我用于该项目的测试链接: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0

Now, the code below works when matching for.mp3 only (line 8), and outputs the plain link to a text file as asked.现在,下面的代码仅在匹配 for.mp3 时有效(第 8 行),并按要求将纯链接输出到文本文件。

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.mp3')

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link + '\n')

The issue is, the test link above contains not only.mp3's but also.flac, and.wav.问题是,上面的测试链接不仅包含.mp3,还包含.flac 和.wav。

When I change the code (line 8) to the following to scrape and return all links containing those extensions above (.mp4, .flac, .wav), it outputs a text file with "mp3", "flac" and "wav".当我将代码(第 8 行)更改为以下内容以抓取并返回包含上述扩展名(.mp4、.flac、.wav)的所有链接时,它会输出一个包含“mp3”、“flac”和“wav”的文本文件. No links.没有链接。

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.(mp3|flac|wav)')

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link + '\n')

I've been trying to understand where the error is.我一直在试图了解错误在哪里。 Regex, or something else?正则表达式,还是其他什么? I can't figure it out.我想不通。

Thank you.谢谢。

That's because in your second code, you capture ( with parenthesis) only the extension of the url/file.那是因为在你的第二个代码中,你只捕获(带括号) url/文件的扩展名。 So, one way to fix that, is to add another capturing group like below ( read comments ):因此,解决该问题的一种方法是添加另一个捕获组,如下所示(阅读评论):

import re
import requests

url = input('Enter URL: ')
html_content = requests.get(url).text

# Define the regular expression pattern to match links
pattern = re.compile(r'(http[s]?://[^\s]+\.(mp3|flac|wav))') # <- added outer parenthesis

# Find all matches in the HTML content
links = re.findall(pattern, html_content)

# Remove duplicates
links = list(set(links))

# Write the extracted links to a text file
with open('links.txt', 'w') as file:
    for link in links:
        file.write(link[0] + '\n') # <- link[0] instead of link[0]

Output: Output:

在此处输入图像描述

Another way to solve this is to have non capturing groups with ?:解决这个问题的另一种方法是使用非捕获组?:

pattern = re.compile(r'http[s]?://[^\s]+\.(?:mp3|flac|wav)')

See here .这里

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么当向超过 5 个垃圾的文件添加新扩展名并将其删除时,从原始扩展名 python 中删除一个字母 - why when add new extension to file more than 5 litter and delete it, remove one letter from original extension python python:正则表达式匹配文件扩展名 - python: Regex matching file extension 当 Python 中有多个文件路径时,正则表达式替换字符串中的文件路径 - Regex to replace filepaths in a string when there's more than one in Python 使用 python 和正则表达式查找字符串并将它们放在一起以替换 a.pdf 的文件名-使用多个组时重命名失败 - Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group 尝试编译扩展类型时出现 CompileError - CompileError when attempting to compile extension type 正则表达式和python表示不匹配多个连续字符 - Regex and python to express not matching more than one consecutive characters Cython扩展类型的Python字符串 - Python strings in a Cython extension type 一个字符串匹配多个正则表达式 - More than One Regex Matching for a string 当我输入一个 function 时,vscode 中的哪些扩展使我能够看到 function 提示? - what extension in vscode that enable me to see the function hints when i type one function? 当我运行格式时,Visual Studio 代码 python 扩展失败 - Visual studio code python extension fails when I run format
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM