简体   繁体   中英

malformed html with python and beautiful soup

I'm using python 3.5 with bs 4.6, selenium 3.6 and phantomjs to scrape this one site. The script runs on my server which is located in the US and I want to scrape a german site. But I ran into kind of a problem. The html I download looks like this:

<div class="col-md-40 product-highlights-container"><div class="product-filters"><select class="colorfilter__select"><option value="{&quot;ebootisId&quot;:&quot;HW102581-1&quot;,&quot;color&quot;:&quot;Midnight Black&quot;,&quot;colorCode&quot;:&quot;000000&quot;,&quot;colorGroup&quot;:&quot;Schwarz&quot;,&quot;colorGroupCode&quot;:&quot;000000&quot;,&quot;deliveryTime&quot;:&quot;2-3 Werktage&quot;,&quot;default&quot;:true,&quot;images&quot;:[{&quot;small&quot;:&quot;/img/dist/HW102581-1_ZU102869_S_1.png&quot;,&quot;medium&quot;:&quot;/img/dist/HW102581-1_ZU102869_M_1.png&quot;,&quot;large&quot;:&quot;/img/dist/HW102581-1_ZU102869_L_1.png&quot;}],&quot;storage&quot;:&quot;64&quot;,&quot;tariffs&quot;:{&quot;TF102910&quot;:{&quot;ebootisId&quot;:&quot;TF102910&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;},&quot;TF101415&quot;:{&quot;ebootisId&quot;:&quot;TF101415&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet&quot;}},&quot;stock&quot;:1086,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;,&quot;price&quot;:49,&quot;offer_id&quot;:&quot;5a8bf20d56b4537a4076868a&quot;,&quot;soldout&quot;:false}">Midnight Black</option><option value="{&quot;ebootisId&quot;:&quot;HW102581-2&quot;,&quot;color&quot;:&quot;Arctic Silver&quot;,&quot;colorCode&quot;:&quot;c7ccd0&quot;,&quot;colorGroup&quot;:&quot;Silber&quot;,&quot;colorGroupCode&quot;:&quot;c0c0c0&quot;,&quot;deliveryTime&quot;:&quot;2-3 Werktage&quot;,&quot;default&quot;:false,&quot;images&quot;:[{&quot;small&quot;:&quot;/img/dist/HW102581-2_ZU102869_S_1.png&quot;,&quot;medium&quot;:&quot;/img/dist/HW102581-2_ZU102869_M_1.png&quot;,&quot;large&quot;:&quot;/img/dist/HW102581-2_ZU102869_L_1.png&quot;}],&quot;storage&quot;:&quot;64&quot;,&quot;tariffs&quot;:{&quot;TF102910&quot;:{&quot;ebootisId&quot;:&quot;TF102910&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;},&quot;TF101415&quot;:{&quot;ebootisId&quot;:&quot;TF101415&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet&quot;}},&quot;stock&quot;:503,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=arctic-silver&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;,&quot;price&quot;:49,&quot;offer_id&quot;:&quot;5a8bf20d56b4537a4076868a&quot;,&quot;soldout&quot;:false}">Arctic Silver</option><option value="{&quot;ebootisId&quot;:&quot;HW102581-3&quot;,&quot;color&quot;:&quot;Orchid Grey&quot;,&quot;colorCode&quot;:&quot;9d9dad&quot;,&quot;colorGroup&quot;:&quot;Grau&quot;,&quot;colorGroupCode&quot;:&quot;dcdcdc&quot;,&quot;deliveryTime&quot;:&quot;2-3 Werktage&quot;,&quot;default&quot;:false,&quot;images&quot;:[{&quot;small&quot;:&quot;/img/dist/HW102581-3_ZU102869_S_1.png&quot;,&quot;medium&quot;:&quot;/img/dist/HW102581-3_ZU102869_M_1.png&quot;,&quot;large&quot;:&quot;/img/dist/HW102581-3_ZU102869_L_1.png&quot;}],&quot;storage&quot;:&quot;64&quot;,&quot;tariffs&quot;:{&quot;TF102910&quot;:{&quot;ebootisId&quot;:&quot;TF102910&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;},&quot;TF101415&quot;:{&quot;ebootisId&quot;:&quot;TF101415&quot;,&quot;price&quot;:49,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet&quot;}},&quot;stock&quot;:500,&quot;url&quot;:&quot;/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=orchid-grey&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet&quot;,&quot;price&quot;:49,&quot;offer_id&quot

It is basically one long line of text, which makes it impossible for me to find the tags I want to find.

If I am using an online beautifier or splitting up the lines myself it works fine, but that's not a viable solution.

I tried to use the prettify() function from bs4 but that didnt work either.

Thats the relevant piece of code:

driver = webdriver.PhantomJS(executable_path = path_to_pjs)
driver.get(link)
f = open(filename, "wb")
f.write(driver.page_source.encode('utf-8'))
f.close()
driver.close()
ecj_data = open(filename ,'r', encoding='utf-8').read()
page_soup = soup(ecj_data,"lxml")
page_soup=page_soup.prettify()

The code you have could be changed as follows. It will create an output file called pretty.html containing the prettify version of the HTML:

from bs4 import BeautifulSoup
from selenium import webdriver

link = 'https://tarife.mediamarkt.de/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&speicher=64&carrier=vodafone&tarif=comfort-allnet'
filename = 'output.html'

driver = webdriver.PhantomJS() #executable_path=path_to_pjs)
driver.get(link)

with open(filename, "wb") as f_output:
    f_output.write(driver.page_source.encode('utf-8'))

page_soup = BeautifulSoup(driver.page_source, "lxml")

with open('pretty.html', 'w') as f_output:
    f_output.write(page_soup.prettify())

driver.close()

Giving you a <div> starting:

<div class="col-md-40 product-highlights-container">
 <div class="product-filters">
  <select class="colorfilter__select">
   <option value='{"ebootisId":"HW102581-1","color":"Midnight Black","colorCode":"000000","colorGroup":"Schwarz","colorGroupCode":"000000","deliveryTime":"2-3 Werktage","default":true,"images":[{"small":"/img/dist/HW102581-1_ZU102869_S_1.png","medium":"/img/dist/HW102581-1_ZU102869_M_1.png","large":"/img/dist/HW102581-1_ZU102869_L_1.png"}],"storage":"64","tariffs":{"TF102910":{"ebootisId":"TF102910","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet"},"TF101415":{"ebootisId":"TF101415","price":49,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=telekom&amp;tarif=comfort-allnet"}},"stock":1075,"url":"/smartphones/samsung/galaxy-s8-inkl-gear-sport?farbe=midnight-black&amp;speicher=64&amp;carrier=vodafone&amp;tarif=comfort-allnet","price":49,"offer_id":"5a8bf20d56b4537a4076868a","soldout":false}'>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM