简体   繁体   English

如何使用 Beautiful Soup 从本地文本文件中提取所有图像 url?

[英]How to extract all image urls from local text file using Beautiful Soup?

I'm new to python and BS.我是 python 和 BS 的新手。 I've a text file in which each line is following format.我有一个文本文件,其中每一行都遵循格式。 I want to extract the image urls from these lines using BS.我想使用 BS 从这些行中提取图像 url。 This is just a text file and not in html format.这只是一个文本文件,而不是 html 格式。

something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >

Following code doesn't do anything and just hangs.以下代码不做任何事情,只是挂起。 How to fix this.如何解决这个问题。

def readFile(fileName):
    with open(fileName, 'r') as fp:
        soup = BeautifulSoup(fp.read(),'html.parser')
        images = soup.findAll('img')
        print("images: ", images)
        
        for image in images:
            print (image['src'])
        
readFile("./imagefile.txt")

Since your input data is not in html format, I don't think BeautifulSoup is the way to go, though I will be happy to be wrong about that.由于您的输入数据不是 html 格式,我不认为 BeautifulSoup 是通往 go 的方式,尽管我很乐意对此有误。 I would start with the re module as a first step.我将从re模块开始作为第一步。

import re

text = '''
something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >
'''

for url in re.findall(r"<img[^>]* src=\"([^\"]*)\"[^>]*>", text):
    print(url)

Should give you:应该给你:

https://example.com/img1.jpg
https://example.com/img2.jpg

About the Pattern: <img[^>]* src=\"([^\"]*)\"[^>]*>关于模式: <img[^>]* src=\"([^\"]*)\"[^>]*>

<img    | matches the characters "<img" literally
[^>]*   | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes before src)
 src=\" | matches the characters " src=\"" literally
(       | start the capture group
[^\"]*  | matches any character that is not the closing quote (between zero and unlimited times)
)       | end the capture group
\"      | matches the closing quote
[^>]*   | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes after src)
>       | the closing tag

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM