Python BeautifulSoup extraction

Question

I have used the following code to access the description that is posted bellow.

Here is the code:

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('item')
print(items[0].description)

I have obtained the following XML sample:

<description>

     &lt;ul&gt;
&lt;li&gt;&lt;img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /&gt; &lt;a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A"&gt;Sta Mar&amp;#237;a del Condado&lt;/a&gt;&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt; Actualizado: 24-07-2018 08:20 UTC&lt;/li&gt;
&lt;li&gt;Temperatura: &lt;b&gt;23,6&lt;/b&gt; &amp;#186;C (
M&amp;#225;x.: &lt;b style="color: red"&gt;23,6&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;12,1&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Humedad: &lt;b&gt;54,0&lt;/b&gt; % (
M&amp;#225;x.: &lt;b style="color: red"&gt;91,0&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;54,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Bar&amp;#243;metro: &lt;b&gt;1021,0&lt;/b&gt; hPa (
M&amp;#225;x.: &lt;b style="color: red"&gt;1021,2&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;1019,9&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Viento: &lt;b&gt;1,0&lt;/b&gt; km/h (
M&amp;#225;x.: &lt;b style="color: red"&gt;9,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Direcci&amp;#243;n del viento: &lt;b&gt;170&lt;/b&gt; - S&lt;/li&gt;
&lt;li&gt;Precip.: &lt;b&gt;0,0&lt;/b&gt; mm&lt;/li&gt;
&lt;/ul&gt;
     &lt;/ul&gt;

<!--
[[<BEGIN:ESCYL2400000024153A:DATA>]]
[[<ESCYL2400000024153A;(23,6;23,6;12,1;sun);(54,0;91,0;54,0);(1021,0;1021,2;1019,9);(1,0;9,0;170);(0,0);Sta Mar&#237;a del Condado>]]
[[<END:ESCYL2400000024153A:DATA>]]
-->
</description>

I want to extract the items contained between the labels [[<BEGIN:ESCYL2400000024153A:DATA>]] and [[<END:ESCYL2400000024153A:DATA>]] . How could I do that in a "pythonic" way without having to manually parse every item as a string?

Edit:

The data I want to extract may also be found in this part of the soup:

&lt;ul&gt;
&lt;li&gt;&lt;img src="http://meteoclimatic.net/img/sem_tpv.png" style="width: 12px; height: 12px; border: 0px;" alt="***" /&gt; &lt;a href="http://www.meteoclimatic.net/perfil/ESCYL2400000024153A"&gt;Sta Mar&amp;#237;a del Condado&lt;/a&gt;&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt; Actualizado: 24-07-2018 08:50 UTC&lt;/li&gt;
&lt;li&gt;Temperatura: &lt;b&gt;24,4&lt;/b&gt; &amp;#186;C (
M&amp;#225;x.: &lt;b style="color: red"&gt;24,5&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;12,1&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Humedad: &lt;b&gt;49,0&lt;/b&gt; % (
M&amp;#225;x.: &lt;b style="color: red"&gt;91,0&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;49,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Bar&amp;#243;metro: &lt;b&gt;1021,0&lt;/b&gt; hPa (
M&amp;#225;x.: &lt;b style="color: red"&gt;1021,2&lt;/b&gt; /
M&amp;#237;n.: &lt;b style="color: blue"&gt;1019,9&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Viento: &lt;b&gt;5,0&lt;/b&gt; km/h (
M&amp;#225;x.: &lt;b style="color: red"&gt;10,0&lt;/b&gt; )&lt;/li&gt;
&lt;li&gt;Direcci&amp;#243;n del viento: &lt;b&gt;219&lt;/b&gt; - SW&lt;/li&gt;
&lt;li&gt;Precip.: &lt;b&gt;0,0&lt;/b&gt; mm&lt;/li&gt;
&lt;/ul&gt;
     &lt;/ul&gt;

Answer 1

Use lxml to get the XML comment in the description element.

from lxml import etree

tree = etree.parse("so.xml")

comment = tree.xpath("/rss/channel/item/description/comment()")[0].text
print(comment.split("\n")[2])

Output:

[[<ESCYL2400000024153A;(24,4;24,5;12,1;sun);(49,0;91,0;49,0);(1021,0;1021,2;1019,9);(5,0;10,0;219);(0,0);Sta Mar&#237;a del Condado>]]

Answer 2

You can do it with BeautifulSoup, using the Comment object:

import requests
from bs4 import BeautifulSoup, Comment

resp = requests.get('https://www.meteoclimatic.net/feed/rss/ESCYL2400000024153A')
soup = BeautifulSoup(resp.content, 'xml')
for item in soup.select('item'):
    comments = item.description.find_all(text=lambda text:isinstance(text, Comment))
    print([c for c in comments[0].split('\n') if c][1:-1])

Prints:

['[[<ESCYL2400000024153A;(24,4;24,5;12,1;sun);(49,0;91,0;49,0);(1021,0;1021,2;1019,9);(5,0;10,0;219);(0,0);Sta Mar&#237;a del Condado>]]']

Edit:

This code iterates through all <item> tags. In each <item> tag it will find in <description> all texts, that's instance of Comment object (in other words anything that is between  tags. Then it will split first comment according newlines and writes all lines but first and last.

Python BeautifulSoup extraction

Question

2 answers

solution1
0

solution2
0 ACCPTED 2018-07-24 09:03:27

Python BeautifulSoup extraction

Question

2 answers

solution1 0

solution2 0 ACCPTED 2018-07-24 09:03:27

solution1
0

solution2
0 ACCPTED 2018-07-24 09:03:27