繁体   English   中英

从html提取一些文本

[英]extract some texts from html

我有html“页面”,如下所示:

<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Times New Roman","serif"'>&nbsp;</span></p>

<p class=MsoNormal><span style='font-size:11.0pt'>ヤブツバキクラス(常緑広葉樹林)</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Camellietea japonicae</span><span lang=EN-US> Miyawaki <i>et</i>
Ohba 1963<br>
</span></span><span style='font-size:11.0pt'> リュウキュウガキ-クスノハガシワオーダー</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Diospyro maritimae-Mallotetalia philippensis</span><span lang=EN-US>
Fujiwara 1981<br>
</span></span><span style='font-size:11.0pt'>  ナガミボチョウジ-リュウキュウガキ群団</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Psychotrio manilensis-Diospyrion maritimae</span><span lang=EN-US>
Niiro <i>et al.</i> 1974<br>

我需要提取如下:

elliクラス(常绿広叶树林),山茶

リDi-クスノハガシワオーダー,Diospyro maritimae-Mallotetalia philippensis

ナガミboチョウジ-リュウキュウガキ群団,Psychorio manilensis-Diospyrion maritimae

我尝试为:

soup = BeautifulSoup(page, features="lxml")

rows = soup.find_all('span')
for row in rows:
        print (row.text.strip().split(' ')[0])

但是,它提取如下:

ヤブツバキクラス(常緑広葉樹林)
Camellietea
Camellietea
Miyawaki
リュウキュウガキ−クスノハガシワオーダー
Diospyro
Diospyro
Fujiwara
ナガミボチョウジ−リュウキュウガキ群団
Psychotrio
Psychotrio
Niiro

单步执行结果,并采用每四个跨度中的前两个:

for i in range(1, len(rows), 4):
    print(rows[i].string.strip(), 
          list(rows[i+1].children)[1].string.strip())

#ヤブツバキクラス(常緑広葉樹林)Camellietea japonicae
#リュウキュウガキ-クスノハガシワオーダー Diospyro maritimae-Mallotetalia philippensis
#ナガミボチョウジ-リュウキュウガキ群団 Psychotrio manilensis-Diospyrion maritimae

您还可以在bs4 4.7.1中使用:first-child和attribute = value选择器

from bs4 import BeautifulSoup as bs

html = '''
<p class=MsoNormal><span lang=EN-US style='font-size:11.0pt;font-family:"Times New Roman","serif"'>&nbsp;</span></p>
<p class=MsoNormal><span style='font-size:11.0pt'>ヤブツバキクラス(常緑広葉樹林)</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Camellietea japonicae</span><span lang=EN-US> Miyawaki <i>et</i>
Ohba 1963<br>
</span></span><span style='font-size:11.0pt'> リュウキュウガキ-クスノハガシワオーダー</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Diospyro maritimae-Mallotetalia philippensis</span><span lang=EN-US>
Fujiwara 1981<br>
</span></span><span style='font-size:11.0pt'>  ナガミボチョウジ-リュウキュウガキ群団</span><span
style='font-size:11.0pt;font-family:"Times New Roman","serif"'> <span
lang=EN-US>Psychotrio manilensis-Diospyrion maritimae</span><span lang=EN-US>
Niiro <i>et al.</i> 1974<br>
  '''

soup = bs(html, 'lxml')
print([i.text.strip() for i in item.select('span[style="font-size:11.0pt"], [lang=EN-US]:first-child')])

在此处输入图片说明

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM