[英]How to extract all image urls from local text file using Beautiful Soup?
I'm new to python and BS.我是 python 和 BS 的新手。 I've a text file in which each line is following format.我有一个文本文件,其中每一行都遵循格式。 I want to extract the image urls from these lines using BS.我想使用 BS 从这些行中提取图像 url。 This is just a text file and not in html format.这只是一个文本文件,而不是 html 格式。
something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >
Following code doesn't do anything and just hangs.以下代码不做任何事情,只是挂起。 How to fix this.如何解决这个问题。
def readFile(fileName):
with open(fileName, 'r') as fp:
soup = BeautifulSoup(fp.read(),'html.parser')
images = soup.findAll('img')
print("images: ", images)
for image in images:
print (image['src'])
readFile("./imagefile.txt")
Since your input data is not in html format, I don't think BeautifulSoup is the way to go, though I will be happy to be wrong about that.由于您的输入数据不是 html 格式,我不认为 BeautifulSoup 是通往 go 的方式,尽管我很乐意对此有误。 I would start with the re
module as a first step.我将从re
模块开始作为第一步。
import re
text = '''
something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >
'''
for url in re.findall(r"<img[^>]* src=\"([^\"]*)\"[^>]*>", text):
print(url)
Should give you:应该给你:
https://example.com/img1.jpg
https://example.com/img2.jpg
About the Pattern: <img[^>]* src=\"([^\"]*)\"[^>]*>
关于模式: <img[^>]* src=\"([^\"]*)\"[^>]*>
<img | matches the characters "<img" literally
[^>]* | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes before src)
src=\" | matches the characters " src=\"" literally
( | start the capture group
[^\"]* | matches any character that is not the closing quote (between zero and unlimited times)
) | end the capture group
\" | matches the closing quote
[^>]* | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes after src)
> | the closing tag
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.