解析w /区分大小写的文本/标签

Question

I'm parsing through text using beautifulsoup and want to return the tags below/under a parent tag. 我正在使用beautifulsoup解析文本，并希望返回父标记下方/下方的标记。 However, between three different documents there exist inconsistencies between how the 'desired data set' is capitalized. 但是，在三个不同的文档之间，“所需数据集”的大写方式之间存在不一致。 See below: 见下文：

<td class="pl "...-unimportant bits of script here-...;>Desired Data Set...</td>

and 和

<td class="pl "...-unimportant bits of script here-...;>Desired data set...</td>

and 和

<td class="pl "...-unimportant bits of script here-...;>desired data set...</td>

This is my code thus far: 到目前为止，这是我的代码：

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(data.text, 'lxml')

filenames = ['Desired Data Set','desired data set','Desired data set']

for filename in filenames:
    for item in soup.select('filename:contains("' + filename + '")'):
                    for td in item.find('td', text=filename).parent.find_all('td'):
                        data = [td.text.strip()]
                        print(data)

...and it works. ......它有效。

However, as I start to work with larger data sets, I'm sure there will be even more inconsistencies, and even though the above approach works it's 'hacky' and isn't efficient or prudent. 然而，当我开始使用更大的数据集时，我确信会出现更多的不一致，即使上述方法有效，它也是“hacky”，并且效率低或不够谨慎。 I would like to just use one Filename for all desired data sets. 我想对所有需要的数据集使用一个文件名。

I've tried to lower the entire soup using lower() but it throws a NoneType error. 我试图使用lower（）降低整个汤，但它会抛出NoneType错误。

Answer 1

You can use string argument of find_all() method: 您可以使用find_all()方法的string参数：

from bs4 import BeautifulSoup

data = '''<table><tr><td class="pl ">Desired Data Set...</td>
<td class="pl ">Desired data set...</td>
<td class="pl ">desired data set...</td>
<td class="pl ">Something else</td>
</tr></table>
'''

soup = BeautifulSoup(data, 'lxml')

for td in soup.find_all('td', string=lambda t: 'desired data set' in t.lower()):
    print(td)

Prints: 打印：

<td class="pl">Desired Data Set...</td>
<td class="pl">Desired data set...</td>
<td class="pl">desired data set...</td>

Answer 2

soup = BeautifulSoup(data.text.lower(), 'lxml')可能是解决问题的“hacky”方式，但对于我的具体示例，它有效。

解析w /区分大小写的文本/标签

问题描述

2 个解决方案

解决方案1
0 2019-06-15 21:21:39

解决方案2
0 2019-06-16 16:21:56

解析w /区分大小写的文本/标签

问题描述

2 个解决方案

解决方案1 0 2019-06-15 21:21:39

解决方案2 0 2019-06-16 16:21:56

解决方案1
0 2019-06-15 21:21:39

解决方案2
0 2019-06-16 16:21:56