I was using beautifulsoup to try to collect information from a website "https://www.yugiohcardguide.com/archetype/abyss-actor.html". The card information is set up relatively neatly. Below is a picture of the html that I was trying to parse through.
I am trying to get all of the tags that contain the information for a single card in each row.
below is the code that I used
def get_card_info_from_link(self, link):
new_link=pre_url+'/'+link #link to the archtype page
html=requests.get(new_link).content
soup=bs(html,'lxml')
info_rows=soup.find('tbody').find_all('tr')
found_cards=[]
# count=0
for i in info_rows:
print('='*50)
print(i)
print('='*50)
# count+=1
Here is the link to the output that I am getting. https://drive.google.com/file/d/1J09nhhrfdje-ktxEG3KLcGwK1cR93ZOo/view?usp=sharing
the first couple of outputs with the equal sign separators are exactly what I was looking for, but at one point it no longer outputs the previous format and instead is an item that contains multiple tags instead of each tag being on its own.
I cannot wrap my head around what the problem is. maybe I am just overlooking a key detail that I am oblivious to.
the html is broken or have unclosed tags
<tr class="row2" valign="top">
.....
</a>
<!-- No </td></tr> -->
<tr class="row2" valign="top">
there are multiple ways to fix, after
html = requests.get(new_link).text # instead of .content
fix it using Regex
fixed_html = re.sub(r'</a>\s+<tr valign="top"', '</a></td></tr><tr valign="top"', html)
or using lxml
or html5lib
soup = BeautifulSoup(html,'html5lib') # or lxml
fixed_html = soup.prettify()
or using tidy
fixed_html = tidy.parseString(html, show_body_only=True)
then parse fixed html
soup = BeautifulSoup(fixed_html,'lxml')
info_rows = soup.find('tbody').find_all('tr')
May be this code will helps you:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get('https://www.yugiohcardguide.com/archetype/abyss-actor.html')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
result = soup.find('tbody').find_all('tr')
print(result)
driver.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.