简体   繁体   English

Python使用BeautifulSoup从HTML解析表

[英]Python parse table from HTML using BeautifulSoup

I am trying to get the tables from multiple html files. 我正在尝试从多个HTML文件获取表格。 Ideally, I have the rows and columns in a list, so I can process it further. 理想情况下,我在列表中有行和列,因此可以对其进行进一步处理。 I am new to BeautifulSoup, but I cannot get it working. 我是BeautifulSoup的新手,但无法正常工作。 I think the main problem occurs when the function returns None, so it cannot be processed further. 我认为主要问题是在函数返回None时发生的,因此无法进一步处理。 I tried if statements but this did not help. 我尝试了if语句,但这无济于事。 My code as it is right now: 我现在的代码:

from bs4 import BeautifulSoup
table_dict = {}
for filename, text in tqdm(lowercase_dict.items()):
    soup = BeautifulSoup(text, "lxml")
    table = soup.find('table')
    table_body = table.find('tbody')
    if table_body is not None:
        tables = table_body

    rows = tables.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])

    table_dict[filename] = cols
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-304-14ade2e7b2ac> in <module>()
      7         tables = table_body
      8 
----> 9     rows = tables.find_all('tr')
     10     for row in rows:
     11         cols = row.find_all('td')

AttributeError: 'str' object has no attribute 'find_all'

```

According to your error message, the problem is that the variable tables is a string. 根据您的错误消息,问题在于变量是一个字符串。 Try it without using 'tbody'. 不使用“ tbody”即可尝试。

for filename, text in tqdm(lowercase_dict.items()):
    soup = BeautifulSoup(text, "lxml")
    table = soup.find('table')
    rows = table.find_all('tr')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM