如何在具有嵌套标签的文本中删除数据？

Question

I am scraping a dictionary website and want to get the English translation of a word.我正在抓取一个字典网站，想要获得一个单词的英文翻译。 I am using soup.find_all() to find the second instance of a tag in the page source.我正在使用soup.find_all()在页面源中查找标签的第二个实例。 But the function is returning a long object because the tags are nested:但是 function 返回一个长 object 因为标签是嵌套的：

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('td', attrs={'class':'ToWrd'})[1]

It returns:它返回：

<td class="ToWrd">pupil <em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>

But I am just interested in "pupil" which is the meaning of the word that I am searching in that dictionary website.但我只对“瞳孔”感兴趣，这是我在该词典网站上搜索的单词的含义。 Can anyone help how to extract this just this word?谁能帮助如何提取这个词？

Please, note that I don't want to use a numpy or pandas function because the code does not have these dependencies and I don't want to add them.请注意，我不想使用 numpy 或 pandas function 因为代码没有这些依赖项，我不想添加它们。 For example, I am not looking for this solution:例如，我不是在寻找这个解决方案：

pd.DataFrame(soup.find_all('td', attrs={'class':'ToWrd'})[1])[0][0]

which returns:返回：

'pupil '

Answer 1

How about using a regex:如何使用正则表达式：

import re

valid = re.compile(r'<td class="ToWrd">(\w+) <em class="tooltip POS2">')
print(valid.match('<td class="ToWrd">pupil <em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>').group(1))

returns返回

pupil

Example above works only if all tags have上面的示例仅在所有标签都具有

<td class="ToWrd">

before and之前和

<em class="tooltip POS2">

after your wanted word though.在你想要的话之后。 But you might adjust the regex accordingly.但是您可以相应地调整正则表达式。

Answer 2

There are different approaches to get your goal - simplest is mentioned by @Tim Roberts - But be aware that it will just work if there is a single word:有不同的方法可以实现您的目标 - @Tim Roberts 提到了最简单的方法 - 但请注意，如果只有一个词，它就会起作用：

soup.find_all('td', attrs={'class':'ToWrd'})[1].text.split()[0]

An alternative, working with single / compound nouns / multiple words is stripped_strings :另一种使用单个/复合名词/多个单词的方法是stripped_strings ：

list(soup.find_all('td', attrs={'class':'ToWrd'})[1].stripped_strings)[0]

Same job will also be done by combine get_text() with parameters and split() , but I prefer stripped_strings :通过将get_text()与参数和split()结合起来也可以完成相同的工作，但我更喜欢stripped_strings ：

soup.find_all('td', attrs={'class':'ToWrd'})[1].get_text('|',strip=True).split('|')[0]

Note: If there is only one <td> with that class use find() instead of find_all()注意：如果只有一个<td>与该 class 使用find()而不是find_all()

Example例子

Will extract single as well as compound nouns / multiple words:将提取单个和复合名词/多个词：

html = '''
<td class="ToWrd">pupil<em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
<td class="ToWrd">ice cream<em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
'''

soup=BeautifulSoup(html,'lxml')

[list(w.stripped_strings)[0] for w in soup.find_all('td', attrs={'class':'ToWrd'})]

Output Output

['pupil', 'ice cream']

如何在具有嵌套标签的文本中删除数据？

问题描述

2 个解决方案

解决方案1
0 2022-01-21 21:04:37

解决方案2
0 2022-01-21 21:14:31

Example例子

Output Output

如何在具有嵌套标签的文本中删除数据？

问题描述

2 个解决方案

解决方案1 0 2022-01-21 21:04:37

解决方案2 0 2022-01-21 21:14:31

Example例子

Output Output

解决方案1
0 2022-01-21 21:04:37

解决方案2
0 2022-01-21 21:14:31