[英]Extracting content from an html tag in python
i tried extracting some text in an html tag but i couldn't get it.我尝试在 html 标签中提取一些文本,但无法获取。 i only want to extract
我只想提取
A di gàárì sílẹ̀ ewúrẹ́ ńyọjú; ẹrù ìran rẹ̀ ni
A fi ọ́ jọba ò ńṣàwúre o fẹ́ jẹ Ọlọ́run ni,
A fijó gba Awà;a fìjà gba Awà; bí a ò bá jó, bí a ò bá jà, bí a bá ti gba Awà, kò tán bí?
then add them to a list然后将它们添加到列表中
from bs4 import BeautifulSoup
soup =BeautifulSoup(html, 'html.parser')
per = {'data':[]}
for br in soup.findAll('p'):
text = br.text#.split('\r\n')[0].replace('?','')
html = """
[<p xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xq="http://namespaces.softwareag.com/tamino/XQuery/result" xmlns:xql="http://metalab.unc.edu/xql/">
A di gàárì sílẹ̀ ewúrẹ́ ńyọjú; ẹrù ìran rẹ̀ ni?<br/>
We prepare the saddle, and the goat presents itself; is it a burden for the lineage of goats?<br/>
(Goats that know their place do not offer their backs to be saddled.)<br/>
This is a variant of A gbé gàárì ọmọ ewúrẹ́ ńrojú . . .<br/>
</p>,
<p xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xq="http://namespaces.softwareag.com/tamino/XQuery/result" xmlns:xql="http://metalab.unc.edu/xql/">
A fi ọ́ jọba ò ńṣàwúre o fẹ́ jẹ Ọlọ́run ni?<br/>
You have been crowned a king, and yet you make good-luck charms; would you be crowned God?<br/>
(Being crowned a king is about the best fortune a mortal could hope for.)<br/>
</p>,
<p xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xq="http://namespaces.softwareag.com/tamino/XQuery/result" xmlns:xql="http://metalab.unc.edu/xql/">
A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, bí a ò bá jà, bí a bá ti gba Awà, kò tán bí?<br/>
By dancing we take possession of Awà; through fighting we take possession of Awà; if we neither dance nor fight, but take possession of Awà anyway, is the result not the same?<br/>
(Why make a huge production of a matter that is easily taken care of?)<br/>
</p>]
"""
Use strip()
before you split lines because it may have empty lines at start.在拆分行之前使用
strip()
因为它可能在开始时有空行。
And I had to use \n
instead of \r\n
我不得不使用
\n
而不是\r\n
for br in soup.findAll('p'):
text = br.text.strip().split('\n')[0].replace('?','')
print(text)
Eventually you can use get_text(strip=True)
but it needs separator="\n"
to keep \n
inside text最终您可以使用
get_text(strip=True)
但它需要separator="\n"
将\n
保留在文本中
for br in soup.findAll('p'):
text = br.get_text(strip=True, separator='\n').split('\n')[0].replace('?','')
print(text)
You can try:你可以试试:
for br in soup.select('p > br:nth-of-type(1)'):
print(br.previous_sibling)
Output输出
A di gàárì sílẹ̀ ewúrẹ́ ńyọjú; ẹrù ìran rẹ̀ ni?
A fi ọ́ jọba ò ńṣàwúre o fẹ́ jẹ Ọlọ́run ni?
A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, bí a ò bá jà, bí a bá ti gba Awà, kò tán bí?
Use contents[0]
which retrieve
the first text value of the P tag.
使用
contents[0]
retrieve
P tag.
html = """
[<p xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xq="http://namespaces.softwareag.com/tamino/XQuery/result" xmlns:xql="http://metalab.unc.edu/xql/">
A di gàárì sílẹ̀ ewúrẹ́ ńyọjú; ẹrù ìran rẹ̀ ni?<br/>
We prepare the saddle, and the goat presents itself; is it a burden for the lineage of goats?<br/>
(Goats that know their place do not offer their backs to be saddled.)<br/>
This is a variant of A gbé gàárì ọmọ ewúrẹ́ ńrojú . . .<br/>
</p>,
<p xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xq="http://namespaces.softwareag.com/tamino/XQuery/result" xmlns:xql="http://metalab.unc.edu/xql/">
A fi ọ́ jọba ò ńṣàwúre o fẹ́ jẹ Ọlọ́run ni?<br/>
You have been crowned a king, and yet you make good-luck charms; would you be crowned God?<br/>
(Being crowned a king is about the best fortune a mortal could hope for.)<br/>
</p>,
<p xmlns:ino="http://namespaces.softwareag.com/tamino/response2" xmlns:xq="http://namespaces.softwareag.com/tamino/XQuery/result" xmlns:xql="http://metalab.unc.edu/xql/">
A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, bí a ò bá jà, bí a bá ti gba Awà, kò tán bí?<br/>
By dancing we take possession of Awà; through fighting we take possession of Awà; if we neither dance nor fight, but take possession of Awà anyway, is the result not the same?<br/>
(Why make a huge production of a matter that is easily taken care of?)<br/>
</p>]
"""
soup=BeautifulSoup(html,'html.parser')
for ptag in soup.find_all('p'):
print(ptag.contents[0])
Output :输出:
A di gàárì sílẹ̀ ewúrẹ́ ńyọjú; ẹrù ìran rẹ̀ ni?
A fi ọ́ jọba ò ńṣàwúre o fẹ́ jẹ Ọlọ́run ni?
A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, bí a ò bá jà, bí a bá ti gba Awà, kò tán bí?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.