简体   繁体   English

无法从html分别提取数字和文本

[英]Unable to extract number & text separately from html

From the html code below I want to get the number separately & the text separately, I am able to get the number but for text it's giving error as shown below. 从下面的html代码中,我想分别获取数字和文本,我能够获取数字,但是对于文本,它给出了错误,如下所示。 (Note: it is in for loop , for few links its working as the split(b'.')[1] is matching, if the index is not found its giving error). (注意:它在for loop ,如果未找到索引给出的错误,则由于split(b'.')[1]是匹配的,因此只有很少的链接)。

Error: 错误:

Traceback (most recent call last):
  File "C:/Users/Computers Zone/Google Drive/Python/SANDWICHTRY.py", line 49, in <module>
    sandwich=soup.find('h1',{'class':'headline'}).encode_contents().strip().split(b'.')[1].decode("utf-8")
IndexError: list index out of range

HTML code: HTML代码:

<h1 class="headline ">1. Old Oak Tap BLT</h1>

Ny code: Ny代码:

soup=BeautifulSoup(pages,'lxml').find('div',{'id':'page'})
rank=soup.find('h1',{'class':'headline'}).encode_contents().strip().split(b'.')[0].decode("utf-8")
print (rank)
sandwich=soup.find('h1',{'class':'headline'}).encode_contents().strip().split(b'.')[1].decode("utf-8")
print(sandwich)

This error occures, when there is no . 没有时发生此错误. in your headline string, ie the second element does not exist. 在标题字符串中,即第二个元素不存在。

To solve this, get the results, split the string, but do not assume that there are always two elements: 要解决此问题,请获得结果,分割字符串,但不要假定总是有两个元素:

from bs4 import BeautifulSoup

pages = '<h1 class="headline">1. Old Oak Tap BLT</h1>'

soup = BeautifulSoup(pages, 'lxml')
titles = soup.find('h1', {'class': 'headline'}).encode_contents().split(b'.')

for text in titles:  # go through all existing list elements
    print(text.decode("utf-8").strip())

Or check for 2 elements in your list prior to reading the elements, eg: 或在阅读元素之前检查列表中的2个元素,例如:

if len(titles) == 2:
    rank = titles[0].decode("utf-8").strip()
    sandwich = titles[1].decode("utf-8").strip()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM