[英]Python link scraper regex works when only searching for 1 extension type, but fails when matching more than one extension type
This is the test link I am using for this project: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0这是我用于该项目的测试链接: https://www.dropbox.com/sh/4cgwf2b6gk4bex4/AADtM1GDYgPDdv8QP6JdSOkba?dl=0
Now, the code below works when matching for.mp3 only (line 8), and outputs the plain link to a text file as asked.现在,下面的代码仅在匹配 for.mp3 时有效(第 8 行),并按要求将纯链接输出到文本文件。
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.mp3')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + '\n')
The issue is, the test link above contains not only.mp3's but also.flac, and.wav.问题是,上面的测试链接不仅包含.mp3,还包含.flac 和.wav。
When I change the code (line 8) to the following to scrape and return all links containing those extensions above (.mp4, .flac, .wav), it outputs a text file with "mp3", "flac" and "wav".当我将代码(第 8 行)更改为以下内容以抓取并返回包含上述扩展名(.mp4、.flac、.wav)的所有链接时,它会输出一个包含“mp3”、“flac”和“wav”的文本文件. No links.
没有链接。
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'http[s]?://[^\s]+\.(mp3|flac|wav)')
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link + '\n')
I've been trying to understand where the error is.我一直在试图了解错误在哪里。 Regex, or something else?
正则表达式,还是其他什么? I can't figure it out.
我想不通。
Thank you.谢谢。
That's because in your second code, you capture ( with parenthesis) only the extension of the url/file.那是因为在你的第二个代码中,你只捕获(带括号) url/文件的扩展名。 So, one way to fix that, is to add another capturing group like below ( read comments ):
因此,解决该问题的一种方法是添加另一个捕获组,如下所示(阅读评论):
import re
import requests
url = input('Enter URL: ')
html_content = requests.get(url).text
# Define the regular expression pattern to match links
pattern = re.compile(r'(http[s]?://[^\s]+\.(mp3|flac|wav))') # <- added outer parenthesis
# Find all matches in the HTML content
links = re.findall(pattern, html_content)
# Remove duplicates
links = list(set(links))
# Write the extracted links to a text file
with open('links.txt', 'w') as file:
for link in links:
file.write(link[0] + '\n') # <- link[0] instead of link[0]
Output: Output:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.