如何使用 Beautiful Soup 从本地文本文件中提取所有图像 url？

Question

I'm new to python and BS.我是 python 和 BS 的新手。 I've a text file in which each line is following format.我有一个文本文件，其中每一行都遵循格式。 I want to extract the image urls from these lines using BS.我想使用 BS 从这些行中提取图像 url。 This is just a text file and not in html format.这只是一个文本文件，而不是 html 格式。

something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >

Following code doesn't do anything and just hangs.以下代码不做任何事情，只是挂起。 How to fix this.如何解决这个问题。

def readFile(fileName):
    with open(fileName, 'r') as fp:
        soup = BeautifulSoup(fp.read(),'html.parser')
        images = soup.findAll('img')
        print("images: ", images)
        
        for image in images:
            print (image['src'])
        
readFile("./imagefile.txt")

Answer 1

Since your input data is not in html format, I don't think BeautifulSoup is the way to go, though I will be happy to be wrong about that.由于您的输入数据不是 html 格式，我不认为 BeautifulSoup 是通往 go 的方式，尽管我很乐意对此有误。 I would start with the re module as a first step.我将从re模块开始作为第一步。

import re

text = '''
something something <img src="https://example.com/img1.jpg" >
something else <img src="https://example.com/img2.jpg" >
'''

for url in re.findall(r"<img[^>]* src=\"([^\"]*)\"[^>]*>", text):
    print(url)

Should give you:应该给你：

https://example.com/img1.jpg
https://example.com/img2.jpg

About the Pattern: <img[^>]* src=\"([^\"]*)\"[^>]*>关于模式： <img[^>]* src=\"([^\"]*)\"[^>]*>

<img    | matches the characters "<img" literally
[^>]*   | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes before src)
 src=\" | matches the characters " src=\"" literally
(       | start the capture group
[^\"]*  | matches any character that is not the closing quote (between zero and unlimited times)
)       | end the capture group
\"      | matches the closing quote
[^>]*   | matches any character that is not the closing tag (between zero and unlimited times) (allows other attributes after src)
>       | the closing tag

如何使用 Beautiful Soup 从本地文本文件中提取所有图像 url？

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-12-12 16:16:52

如何使用 Beautiful Soup 从本地文本文件中提取所有图像 url？

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-12-12 16:16:52

解决方案1
2 已采纳 2021-12-12 16:16:52