使用python从HTML获取文本

Question

我有HTML数据，我想获取之间的所有文本

标签并将其放入数据帧以进行进一步处理。

但是我只想要

这些标签之间的标签：

            <div class="someclass" itemprop="text">
                    <p>some text</p>
            </div>

使用BeautifulSoup我可以在所有

标签足够容易。 但是正如我说的，除非在这些标签之间，否则我不想要它。

Answer 1

如果希望文本中的文本仅与特定类相关联，则可以使用BeautifulSoup使用attrs属性指定这些特定类：

html = '''<div class="someclass" itemprop="text">
                    <p>some text</p>
            </div>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

tags = soup.find_all('div', attrs={'class': 'someclass'})

for tag in tags:
    print(tag.text.strip())

输出：

some text

Answer 2

如果您需要一个特定于表的解决方案，我会尝试这样的方法（如果您不愿意，则比较合适的答案是：

import lxml
from bs4 import BeautifulSoup

innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')

# Identify the table that will contain your <div> tags by its class
table = soup.find('table', attrs={'class':'class_name_of_table_here'})
table_body = table.find('tbody')
divs = table_body.find_all(['div'], attrs={'class':['someclass']})

for div in divs:
    try:
        selected_text = div.text
    except:
        pass

print(selected_text)

Answer 3

如果您想使用父级div选择p并具有someclass类， someclass可以

html = '''<div class="someclass" itemprop="text">
            <p>some text</p>
            <span>not this text</span>   
          </div>
          <div class="someclass" itemprop="text">
            <div>not this text</div>   
          </div>
'''

soup = BeautifulSoup(html, 'html.parser')
p = soup.select_one('div.someclass p') # or select()
print(p.text)
# some text

使用python从HTML获取文本

问题描述

3 个解决方案

解决方案1
1 2019-01-05 01:48:44

解决方案2
1 2019-01-05 01:58:33

解决方案3
0 2019-01-05 02:21:08

使用python从HTML获取文本

问题描述

3 个解决方案

解决方案1 1 2019-01-05 01:48:44

解决方案2 1 2019-01-05 01:58:33

解决方案3 0 2019-01-05 02:21:08

解决方案1
1 2019-01-05 01:48:44

解决方案2
1 2019-01-05 01:58:33

解决方案3
0 2019-01-05 02:21:08