Python BeautifulSoup提取

Question

I have used the following code to access the description that is posted bellow. 我已使用以下代码访问以下发布的描述。

Here is the code: 这是代码：

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')
print(items[0].description)

I have obtained the following XML sample: 我获得了以下XML示例：

<description>

     &lt;ul&gt;
&lt;li&gt;&lt;img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /&gt; &lt;a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A"&gt;Sta Mar&amp;#237;a del Condado&lt;/a&gt;&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt; Actualizado: 24-07-2018 08:20 UTC&lt;/li&gt;
&lt;li&gt;Temperatura: &lt;b&gt;23,6&lt;/b&gt; &amp;#186;C (
M&amp;#225;x.: &lt;b style="color: red"&gt;23,6&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;12,1&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Humedad: &lt;b&gt;54,0&lt;/b&gt; % (
M&amp;#225;x.: &lt;b style="color: red"&gt;91,0&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;54,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Bar&amp;#243;metro: &lt;b&gt;1021,0&lt;/b&gt; hPa (
M&amp;#225;x.: &lt;b style="color: red"&gt;1021,2&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;1019,9&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Viento: &lt;b&gt;1,0&lt;/b&gt; km/h (
M&amp;#225;x.: &lt;b style="color: red"&gt;9,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Direcci&amp;#243;n del viento: &lt;b&gt;170&lt;/b&gt; - S&lt;/li&gt;
&lt;li&gt;Precip.: &lt;b&gt;0,0&lt;/b&gt; mm&lt;/li&gt;
&lt;/ul&gt;
     &lt;/ul&gt;

<!--
[[<BEGIN:ESCYL2400000024153A:DATA>]]
[[<ESCYL2400000024153A;(23,6;23,6;12,1;sun);(54,0;91,0;54,0);(1021,0;1021,2;1019,9);(1,0;9,0;170);(0,0);Sta Mar&#237;a del Condado>]]
[[<END:ESCYL2400000024153A:DATA>]]
-->
</description>

I want to extract the items contained between the labels [[<BEGIN:ESCYL2400000024153A:DATA>]] and [[<END:ESCYL2400000024153A:DATA>]] . 我想提取标签[[<BEGIN:ESCYL2400000024153A:DATA>]]和[[<END:ESCYL2400000024153A:DATA>]] 。 How could I do that in a "pythonic" way without having to manually parse every item as a string? 我如何以“ pythonic”方式做到这一点而不必手动将每个项目解析为字符串？

Edit: 编辑：

The data I want to extract may also be found in this part of the soup: 我想要提取的数据也可以在汤的这一部分中找到：

&lt;ul&gt;
&lt;li&gt;&lt;img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /&gt; &lt;a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A"&gt;Sta Mar&amp;#237;a del Condado&lt;/a&gt;&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt; Actualizado: 24-07-2018 08:50 UTC&lt;/li&gt;
&lt;li&gt;Temperatura: &lt;b&gt;24,4&lt;/b&gt; &amp;#186;C (
M&amp;#225;x.: &lt;b style="color: red"&gt;24,5&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;12,1&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Humedad: &lt;b&gt;49,0&lt;/b&gt; % (
M&amp;#225;x.: &lt;b style="color: red"&gt;91,0&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;49,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Bar&amp;#243;metro: &lt;b&gt;1021,0&lt;/b&gt; hPa (
M&amp;#225;x.: &lt;b style="color: red"&gt;1021,2&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;1019,9&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Viento: &lt;b&gt;5,0&lt;/b&gt; km/h (
M&amp;#225;x.: &lt;b style="color: red"&gt;10,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Direcci&amp;#243;n del viento: &lt;b&gt;219&lt;/b&gt; - SW&lt;/li&gt;
&lt;li&gt;Precip.: &lt;b&gt;0,0&lt;/b&gt; mm&lt;/li&gt;
&lt;/ul&gt;
     &lt;/ul&gt;

Answer 1

Use lxml to get the XML comment in the description element. 使用lxml在description元素中获取XML注释。

from lxml import etree

tree = etree.parse("so.xml")

comment = tree.xpath("/rss/channel/item/description/comment()")[0].text
print(comment.split("\n")[2])

Output: 输出：

[[<ESCYL2400000024153A;(24,4;24,5;12,1;sun);(49,0;91,0;49,0);(1021,0;1021,2;1019,9);(5,0;10,0;219);(0,0);Sta Mar&#237;a del Condado>]]

Answer 2

You can do it with BeautifulSoup, using the Comment object: 您可以使用Comment对象使用BeautifulSoup做到这一点：

import requests
from bs4 import BeautifulSoup, Comment

resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, 'xml')
for item in soup.select('item'):
    comments = item.description.find_all(text=lambda text:isinstance(text, Comment))
    print([c for c in comments[0].split('\n') if c][1:-1])

Prints: 印刷品：

['[[<ESCYL2400000024153A;(24,4;24,5;12,1;sun);(49,0;91,0;49,0);(1021,0;1021,2;1019,9);(5,0;10,0;219);(0,0);Sta Mar&#237;a del Condado>]]']

Edit: 编辑：

This code iterates through all <item> tags. 此代码遍历所有<item>标签。 In each <item> tag it will find in <description> all texts, that's instance of Comment object (in other words anything that is between  tags. Then it will split first comment according newlines and writes all lines but first and last. 在每个<item>标记中，它将在<description>找到所有文本，即Comment对象的实例（换句话说，就是标记之间的任何内容。然后，它将根据换行符拆分第一个注释并写入所有行，但第一和最后。

Python BeautifulSoup提取

问题描述

2 个解决方案

解决方案1
0

解决方案2
0 已采纳 2018-07-24 09:03:27

Python BeautifulSoup提取

问题描述

2 个解决方案

解决方案1 0

解决方案2 0 已采纳 2018-07-24 09:03:27

解决方案1
0

解决方案2
0 已采纳 2018-07-24 09:03:27