美丽的汤没有返回预期的结果

Question

我正在使用 beautifulsoup 尝试从网站“https://www.yugiohcardguide.com/archetype/abyss-actor.html”收集信息。 卡片信息设置比较整齐。 下面是我试图解析的 html 的图片。

我正在尝试获取每行中包含一张卡片信息的所有标签。

下面是我使用的代码

def get_card_info_from_link(self, link):
    
    new_link=pre_url+'/'+link #link to the archtype page
    html=requests.get(new_link).content
    soup=bs(html,'lxml')
    info_rows=soup.find('tbody').find_all('tr')
    
    found_cards=[]
    
    # count=0
    
    
    for i in info_rows:
            
            print('='*50)
            print(i)
            print('='*50)
            
            # count+=1

这是我得到的 output 的链接。 https://drive.google.com/file/d/1J09nhhrfdje-ktxEG3KLcGwK1cR93ZOo/view?usp=sharing

带有等号分隔符的前几个输出正是我想要的，但在某一时刻，它不再输出以前的格式，而是一个包含多个标签的项目，而不是每个标签都是独立的。

我无法理解问题所在。 也许我只是忽略了一个我没有注意到的关键细节。

Answer 1

html 损坏或有未闭合的标签

<tr class="row2" valign="top">
.....
</a> 
<!-- No </td></tr> -->
<tr class="row2" valign="top">

有多种方法可以修复，之后

html = requests.get(new_link).text # instead of .content

使用正则Regex修复它

fixed_html = re.sub(r'</a>\s+<tr valign="top"', '</a></td></tr><tr valign="top"', html)

或使用lxml或html5lib

soup = BeautifulSoup(html,'html5lib') # or lxml
fixed_html = soup.prettify()

或使用tidy

fixed_html = tidy.parseString(html, show_body_only=True)

然后解析固定的 html

soup = BeautifulSoup(fixed_html,'lxml')
info_rows = soup.find('tbody').find_all('tr')

Answer 2

可能是这段代码会帮助你：

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.yugiohcardguide.com/archetype/abyss-actor.html')
html = driver.page_source

soup = BeautifulSoup(html, 'html.parser')
result = soup.find('tbody').find_all('tr')
print(result)
driver.close()

美丽的汤没有返回预期的结果

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-01-31 09:04:54

解决方案2
0 2021-01-31 08:59:18

美丽的汤没有返回预期的结果

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-01-31 09:04:54

解决方案2 0 2021-01-31 08:59:18

解决方案1
1 已采纳 2021-01-31 09:04:54

解决方案2
0 2021-01-31 08:59:18