[英]Python BeautifulSoup extraction
I have used the following code to access the description that is posted bellow. 我已使用以下代码访问以下发布的描述。
Here is the code: 这是代码:
import requests
from bs4 import BeautifulSoup
resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')
print(items[0].description)
I have obtained the following XML sample: 我获得了以下XML示例:
<description>
<ul>
<li><img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /> <a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A">Sta Mar&#237;a del Condado</a></li>
<ul>
<li> Actualizado: 24-07-2018 08:20 UTC</li>
<li>Temperatura: <b>23,6</b> &#186;C (
M&#225;x.: <b style="color: red">23,6</b> /
M&#237;n.: <b style="color: blue">12,1</b> )</li>
<li>Humedad: <b>54,0</b> % (
M&#225;x.: <b style="color: red">91,0</b> /
M&#237;n.: <b style="color: blue">54,0</b> )</li>
<li>Bar&#243;metro: <b>1021,0</b> hPa (
M&#225;x.: <b style="color: red">1021,2</b> /
M&#237;n.: <b style="color: blue">1019,9</b> )</li>
<li>Viento: <b>1,0</b> km/h (
M&#225;x.: <b style="color: red">9,0</b> )</li>
<li>Direcci&#243;n del viento: <b>170</b> - S</li>
<li>Precip.: <b>0,0</b> mm</li>
</ul>
</ul>
<!--
[[<BEGIN:ESCYL2400000024153A:DATA>]]
[[<ESCYL2400000024153A;(23,6;23,6;12,1;sun);(54,0;91,0;54,0);(1021,0;1021,2;1019,9);(1,0;9,0;170);(0,0);Sta María del Condado>]]
[[<END:ESCYL2400000024153A:DATA>]]
-->
</description>
I want to extract the items contained between the labels [[<BEGIN:ESCYL2400000024153A:DATA>]]
and [[<END:ESCYL2400000024153A:DATA>]]
. 我想提取标签
[[<BEGIN:ESCYL2400000024153A:DATA>]]
和[[<END:ESCYL2400000024153A:DATA>]]
。 How could I do that in a "pythonic" way without having to manually parse every item as a string? 我如何以“ pythonic”方式做到这一点而不必手动将每个项目解析为字符串?
Edit: 编辑:
The data I want to extract may also be found in this part of the soup: 我想要提取的数据也可以在汤的这一部分中找到:
<ul>
<li><img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /> <a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A">Sta Mar&#237;a del Condado</a></li>
<ul>
<li> Actualizado: 24-07-2018 08:50 UTC</li>
<li>Temperatura: <b>24,4</b> &#186;C (
M&#225;x.: <b style="color: red">24,5</b> /
M&#237;n.: <b style="color: blue">12,1</b> )</li>
<li>Humedad: <b>49,0</b> % (
M&#225;x.: <b style="color: red">91,0</b> /
M&#237;n.: <b style="color: blue">49,0</b> )</li>
<li>Bar&#243;metro: <b>1021,0</b> hPa (
M&#225;x.: <b style="color: red">1021,2</b> /
M&#237;n.: <b style="color: blue">1019,9</b> )</li>
<li>Viento: <b>5,0</b> km/h (
M&#225;x.: <b style="color: red">10,0</b> )</li>
<li>Direcci&#243;n del viento: <b>219</b> - SW</li>
<li>Precip.: <b>0,0</b> mm</li>
</ul>
</ul>
Use lxml
to get the XML comment in the description
element. 使用
lxml
在description
元素中获取XML注释。
from lxml import etree
tree = etree.parse("so.xml")
comment = tree.xpath("/rss/channel/item/description/comment()")[0].text
print(comment.split("\n")[2])
Output: 输出:
[[<ESCYL2400000024153A;(24,4;24,5;12,1;sun);(49,0;91,0;49,0);(1021,0;1021,2;1019,9);(5,0;10,0;219);(0,0);Sta María del Condado>]]
You can do it with BeautifulSoup, using the Comment
object: 您可以使用
Comment
对象使用BeautifulSoup做到这一点:
import requests
from bs4 import BeautifulSoup, Comment
resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, 'xml')
for item in soup.select('item'):
comments = item.description.find_all(text=lambda text:isinstance(text, Comment))
print([c for c in comments[0].split('\n') if c][1:-1])
Prints: 印刷品:
['[[<ESCYL2400000024153A;(24,4;24,5;12,1;sun);(49,0;91,0;49,0);(1021,0;1021,2;1019,9);(5,0;10,0;219);(0,0);Sta María del Condado>]]']
Edit: 编辑:
This code iterates through all <item>
tags. 此代码遍历所有
<item>
标签。 In each <item>
tag it will find in <description>
all texts, that's instance of Comment
object (in other words anything that is between <!--
and -->
tags. Then it will split first comment according newlines and writes all lines but first and last. 在每个
<item>
标记中,它将在<description>
找到所有文本,即Comment
对象的实例(换句话说,就是<!--
和-->
标记之间的任何内容。然后,它将根据换行符拆分第一个注释并写入所有行,但第一和最后。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.