简体   繁体   中英

Extract Text and the image from a webpage using BeautifulSoup

I seem to have hit a wall and I am looking for some help/guidance.

I am trying to extract data from a html page - I can extract the text or the image file alone but not together:

Within the HTML file there is multiple occurrences off a heading and the associated text:

Example:

<h2>Builder ind=BOB</h2>

<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- TXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image117.png" width=997 height=601>

<h2>Builder ind=ROB</h2>

<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- EXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image118.png" width=997 height=601>

In the example above I am trying to extract the text contained within the h2 tags and the associated img src tag and export them to a csv file

Extracting the image text code that i have: { from urllib.request import urlopen from bs4 import BeautifulSoup import re

fname = '\\\\C:\\TEMP\\\PAGE.htm' 
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
for image in images: 
    print(image['src']+'\n')

How would i go about looping through the file and extracting both the texts and the and port to a file?

In the final output I am trying to achieve the following in a csv file:

  1. ind=BOB,image117.png
  2. ind=ROB,image118.png

The output that I get currently is:

gfx/image117.png

gfx/image118.png

Try this approach:

from bs4 import BeautifulSoup
import re
fname = '\\\\C:\\TEMP\\\PAGE.htm' 
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
headings = bs.find_all('h2')
for i in range(len(images)): 
    print(headings[i].text.split(" ")[1]+", "+images[i]['src'])

Output:

ind=BOB, gfx/image117.png
ind=ROB, gfx/image118.png

Or If you want to store your output in a csv file so you should try this approach:

from bs4 import BeautifulSoup
import re
import csv

fname = 'PAGE.htm' 
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
headings = bs.find_all('h2')
with open('data.csv', 'w') as file:
    writer = csv.writer(file)
    for i in range(len(images)):
        #headingPlusImage = list(headings[i].text.split(" ")[1]+", "+images[i]['src'])
        heading = headings[i].text.split(" ")[1]
        image = images[i]['src']
        print(heading,"," ,image)
        writer.writerow([heading, image])
from bs4 import BeautifulSoup
html = """
<h2>Builder ind=BOB</h2>

<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- TXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image117.png" width=997 height=601>

<h2>Builder ind=ROB</h2>

<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- EXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image118.png" width=997 height=601>
"""

soup = BeautifulSoup(html, 'html.parser')

for item in soup.findAll("h2"):
    print("Text: {}, Image: {}".format(
        item.text, item.find_next("img").get("src")))

Output:

Text: Builder ind=BOB, Image: gfx/image117.png
Text: Builder ind=ROB, Image: gfx/image118.png

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM