简体   繁体   English

如何在具有嵌套标签的文本中删除数据?

[英]How to scrap data in a text that has nested tags?

I am scraping a dictionary website and want to get the English translation of a word.我正在抓取一个字典网站,想要获得一个单词的英文翻译。 I am using soup.find_all() to find the second instance of a tag in the page source.我正在使用soup.find_all()在页面源中查找标签的第二个实例。 But the function is returning a long object because the tags are nested:但是 function 返回一个长 object 因为标签是嵌套的:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('td', attrs={'class':'ToWrd'})[1]

It returns:它返回:

<td class="ToWrd">pupil <em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>

But I am just interested in "pupil" which is the meaning of the word that I am searching in that dictionary website.但我只对“瞳孔”感兴趣,这是我在该词典网站上搜索的单词的含义。 Can anyone help how to extract this just this word?谁能帮助如何提取这个词?

Please, note that I don't want to use a numpy or pandas function because the code does not have these dependencies and I don't want to add them.请注意,我不想使用 numpy 或 pandas function 因为代码没有这些依赖项,我不想添加它们。 For example, I am not looking for this solution:例如,我不是在寻找这个解决方案:

pd.DataFrame(soup.find_all('td', attrs={'class':'ToWrd'})[1])[0][0]

which returns:返回:

'pupil '

How about using a regex:如何使用正则表达式:

import re

valid = re.compile(r'<td class="ToWrd">(\w+) <em class="tooltip POS2">')
print(valid.match('<td class="ToWrd">pupil <em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>').group(1))

returns返回

pupil

Example above works only if all tags have上面的示例仅在所有标签都具有

<td class="ToWrd">

before and之前和

<em class="tooltip POS2">

after your wanted word though.在你想要的话之后。 But you might adjust the regex accordingly.但是您可以相应地调整正则表达式。

There are different approaches to get your goal - simplest is mentioned by @Tim Roberts - But be aware that it will just work if there is a single word:有不同的方法可以实现您的目标 - @Tim Roberts 提到了最简单的方法 - 但请注意,如果只有一个词,它就会起作用:

soup.find_all('td', attrs={'class':'ToWrd'})[1].text.split()[0]

An alternative, working with single / compound nouns / multiple words is stripped_strings :另一种使用单个/复合名词/多个单词的方法是stripped_strings

list(soup.find_all('td', attrs={'class':'ToWrd'})[1].stripped_strings)[0]

Same job will also be done by combine get_text() with parameters and split() , but I prefer stripped_strings :通过将get_text()与参数和split()结合起来也可以完成相同的工作,但我更喜欢stripped_strings

soup.find_all('td', attrs={'class':'ToWrd'})[1].get_text('|',strip=True).split('|')[0]

Note: If there is only one <td> with that class use find() instead of find_all()注意:如果只有一个<td>与该 class 使用find()而不是find_all()

Example例子

Will extract single as well as compound nouns / multiple words:将提取单个和复合名词/多个词:

html = '''
<td class="ToWrd">pupil<em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
<td class="ToWrd">ice cream<em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
'''

soup=BeautifulSoup(html,'lxml')

[list(w.stripped_strings)[0] for w in soup.find_all('td', attrs={'class':'ToWrd'})]

Output Output

['pupil', 'ice cream']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我尝试在工作街中废弃数据。 如果数据“NoneType”object 没有属性“文本”,如何跳过数据 - i try to scrap data in jobstreet. how to skip data if the data 'NoneType' object has no attribute 'text' 如何使用此 AttributeError: 'NoneType' object 没有属性 'text' 来废弃数据? - How can I scrap data with this AttributeError: 'NoneType' object has no attribute 'text'? 如何从HTTP响应中提取JSON数据并将其存储在python中的excel /文本文件中 - How to scrap JSON data from HTTP response and store it in excel/text file with python 如何访问嵌套跨度标签内的数据 - How to access data within nested span tags 如何使用 Beautifulsoup 废弃锚标记的文本? - How do I scrap the text of anchor tag using Beautifulsoup? 废数据项目 Python - Scrap Data Project Python 提取带有其他文本数据的嵌套标签作为scrapy中的字符串 - Extract nested tags with other text data as string in scrapy 如何使用 Selenium 有效地从动态网站中抓取数据? - How to efficiently scrap data from dynamic websites using Selenium? 如何在不物理滚动的情况下获取 scrap web 整个页面数据? - How to get scrap web entire page data without physically scrolling? 如何从BeautifulSoup中的多个嵌套标签获取原始文本? - How to get raw text from multiple nested tags in BeautifulSoup?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM