Python使用BeautifulSoup从HTML解析表

Question

I am trying to get the tables from multiple html files. 我正在尝试从多个HTML文件获取表格。 Ideally, I have the rows and columns in a list, so I can process it further. 理想情况下，我在列表中有行和列，因此可以对其进行进一步处理。 I am new to BeautifulSoup, but I cannot get it working. 我是BeautifulSoup的新手，但无法正常工作。 I think the main problem occurs when the function returns None, so it cannot be processed further. 我认为主要问题是在函数返回None时发生的，因此无法进一步处理。 I tried if statements but this did not help. 我尝试了if语句，但这无济于事。 My code as it is right now: 我现在的代码：

from bs4 import BeautifulSoup
table_dict = {}
for filename, text in tqdm(lowercase_dict.items()):
    soup = BeautifulSoup(text, "lxml")
    table = soup.find('table')
    table_body = table.find('tbody')
    if table_body is not None:
        tables = table_body

    rows = tables.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])

    table_dict[filename] = cols

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-304-14ade2e7b2ac> in <module>()
      7         tables = table_body
      8 
----> 9     rows = tables.find_all('tr')
     10     for row in rows:
     11         cols = row.find_all('td')

AttributeError: 'str' object has no attribute 'find_all'

```

Answer 1

According to your error message, the problem is that the variable tables is a string. 根据您的错误消息，问题在于变量表是一个字符串。 Try it without using 'tbody'. 不使用“ tbody”即可尝试。

for filename, text in tqdm(lowercase_dict.items()):
    soup = BeautifulSoup(text, "lxml")
    table = soup.find('table')
    rows = table.find_all('tr')

Python使用BeautifulSoup从HTML解析表

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-07-18 21:18:57

Python使用BeautifulSoup从HTML解析表

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-07-18 21:18:57

解决方案1
0 已采纳 2019-07-18 21:18:57