[英]How to scrap data in a text that has nested tags?
I am scraping a dictionary website and want to get the English translation of a word.我正在抓取一个字典网站,想要获得一个单词的英文翻译。 I am using
soup.find_all()
to find the second instance of a tag in the page source.我正在使用
soup.find_all()
在页面源中查找标签的第二个实例。 But the function is returning a long object because the tags are nested:但是 function 返回一个长 object 因为标签是嵌套的:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
soup.find_all('td', attrs={'class':'ToWrd'})[1]
It returns:它返回:
<td class="ToWrd">pupil <em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
But I am just interested in "pupil" which is the meaning of the word that I am searching in that dictionary website.但我只对“瞳孔”感兴趣,这是我在该词典网站上搜索的单词的含义。 Can anyone help how to extract this just this word?
谁能帮助如何提取这个词?
Please, note that I don't want to use a numpy or pandas function because the code does not have these dependencies and I don't want to add them.请注意,我不想使用 numpy 或 pandas function 因为代码没有这些依赖项,我不想添加它们。 For example, I am not looking for this solution:
例如,我不是在寻找这个解决方案:
pd.DataFrame(soup.find_all('td', attrs={'class':'ToWrd'})[1])[0][0]
which returns:返回:
'pupil '
How about using a regex:如何使用正则表达式:
import re
valid = re.compile(r'<td class="ToWrd">(\w+) <em class="tooltip POS2">')
print(valid.match('<td class="ToWrd">pupil <em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>').group(1))
returns返回
pupil
Example above works only if all tags have上面的示例仅在所有标签都具有
<td class="ToWrd">
before and之前和
<em class="tooltip POS2">
after your wanted word though.在你想要的话之后。 But you might adjust the regex accordingly.
但是您可以相应地调整正则表达式。
There are different approaches to get your goal - simplest is mentioned by @Tim Roberts - But be aware that it will just work if there is a single word:有不同的方法可以实现您的目标 - @Tim Roberts 提到了最简单的方法 - 但请注意,如果只有一个词,它就会起作用:
soup.find_all('td', attrs={'class':'ToWrd'})[1].text.split()[0]
An alternative, working with single / compound nouns / multiple words is stripped_strings
:另一种使用单个/复合名词/多个单词的方法是
stripped_strings
:
list(soup.find_all('td', attrs={'class':'ToWrd'})[1].stripped_strings)[0]
Same job will also be done by combine get_text()
with parameters and split()
, but I prefer stripped_strings
:通过将
get_text()
与参数和split()
结合起来也可以完成相同的工作,但我更喜欢stripped_strings
:
soup.find_all('td', attrs={'class':'ToWrd'})[1].get_text('|',strip=True).split('|')[0]
Note: If there is only one <td>
with that class use find()
instead of find_all()
注意:如果只有一个
<td>
与该 class 使用find()
而不是find_all()
Will extract single as well as compound nouns / multiple words:将提取单个和复合名词/多个词:
html = '''
<td class="ToWrd">pupil<em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
<td class="ToWrd">ice cream<em class="tooltip POS2">n<span><i>noun</i>: Refers to person, place, thing, quality, etc.</span></em></td>
'''
soup=BeautifulSoup(html,'lxml')
[list(w.stripped_strings)[0] for w in soup.find_all('td', attrs={'class':'ToWrd'})]
['pupil', 'ice cream']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.