I seem to have hit a wall and I am looking for some help/guidance.
I am trying to extract data from a html page - I can extract the text or the image file alone but not together:
Within the HTML file there is multiple occurrences off a heading and the associated text:
Example:
<h2>Builder ind=BOB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- TXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image117.png" width=997 height=601>
<h2>Builder ind=ROB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- EXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image118.png" width=997 height=601>
In the example above I am trying to extract the text contained within the h2 tags and the associated img src tag and export them to a csv file
Extracting the image text code that i have: { from urllib.request import urlopen from bs4 import BeautifulSoup import re
fname = '\\\\C:\\TEMP\\\PAGE.htm'
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
for image in images:
print(image['src']+'\n')
How would i go about looping through the file and extracting both the texts and the and port to a file?
In the final output I am trying to achieve the following in a csv file:
The output that I get currently is:
gfx/image117.png
gfx/image118.png
Try this approach:
from bs4 import BeautifulSoup
import re
fname = '\\\\C:\\TEMP\\\PAGE.htm'
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
headings = bs.find_all('h2')
for i in range(len(images)):
print(headings[i].text.split(" ")[1]+", "+images[i]['src'])
Output:
ind=BOB, gfx/image117.png
ind=ROB, gfx/image118.png
Or If you want to store your output in a csv file so you should try this approach:
from bs4 import BeautifulSoup
import re
import csv
fname = 'PAGE.htm'
html= open(fname)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.png')})
headings = bs.find_all('h2')
with open('data.csv', 'w') as file:
writer = csv.writer(file)
for i in range(len(images)):
#headingPlusImage = list(headings[i].text.split(" ")[1]+", "+images[i]['src'])
heading = headings[i].text.split(" ")[1]
image = images[i]['src']
print(heading,"," ,image)
writer.writerow([heading, image])
from bs4 import BeautifulSoup
html = """
<h2>Builder ind=BOB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- TXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image117.png" width=997 height=601>
<h2>Builder ind=ROB</h2>
<table border=0 cellpadding=0 cellspacing=0>
<tr>
<td align=left valign=top>
</td>
<td align=left valign=top><br>
<h3>TEST -- EXF 1234 -- 04/01/2020 6:21:42 PM</h3>
<img src="gfx/image118.png" width=997 height=601>
"""
soup = BeautifulSoup(html, 'html.parser')
for item in soup.findAll("h2"):
print("Text: {}, Image: {}".format(
item.text, item.find_next("img").get("src")))
Output:
Text: Builder ind=BOB, Image: gfx/image117.png
Text: Builder ind=ROB, Image: gfx/image118.png
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.