简体   繁体   中英

Beautiful soup not returning expected result

I was using beautifulsoup to try to collect information from a website "https://www.yugiohcardguide.com/archetype/abyss-actor.html". The card information is set up relatively neatly. Below is a picture of the html that I was trying to parse through.

在此处输入图像描述

I am trying to get all of the tags that contain the information for a single card in each row.

below is the code that I used

def get_card_info_from_link(self, link):
    
    new_link=pre_url+'/'+link #link to the archtype page
    html=requests.get(new_link).content
    soup=bs(html,'lxml')
    info_rows=soup.find('tbody').find_all('tr')
    
    found_cards=[]
    
    # count=0
    
    
    for i in info_rows:
            
            print('='*50)
            print(i)
            print('='*50)
            
            # count+=1

Here is the link to the output that I am getting. https://drive.google.com/file/d/1J09nhhrfdje-ktxEG3KLcGwK1cR93ZOo/view?usp=sharing

the first couple of outputs with the equal sign separators are exactly what I was looking for, but at one point it no longer outputs the previous format and instead is an item that contains multiple tags instead of each tag being on its own.

I cannot wrap my head around what the problem is. maybe I am just overlooking a key detail that I am oblivious to.

the html is broken or have unclosed tags

<tr class="row2" valign="top">
.....
</a> 
<!-- No </td></tr> -->
<tr class="row2" valign="top">

there are multiple ways to fix, after

html = requests.get(new_link).text # instead of .content

fix it using Regex

fixed_html = re.sub(r'</a>\s+<tr valign="top"', '</a></td></tr><tr valign="top"', html)

or using lxml or html5lib

soup = BeautifulSoup(html,'html5lib') # or lxml
fixed_html = soup.prettify()

or using tidy

fixed_html = tidy.parseString(html, show_body_only=True)

then parse fixed html

soup = BeautifulSoup(fixed_html,'lxml')
info_rows = soup.find('tbody').find_all('tr')

May be this code will helps you:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.yugiohcardguide.com/archetype/abyss-actor.html')
html = driver.page_source

soup = BeautifulSoup(html, 'html.parser')
result = soup.find('tbody').find_all('tr')
print(result)
driver.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM