简体   繁体   中英

Extract unique elements from a list returned by BeautifulSoup

I have countries list scraped from this website having values like

(Note: this is the output of all_countries after iterating its elements)

<a  data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
<a  data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
<a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a  data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
<a  data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
<a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a  data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a  data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
<a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>

What I want to do is that get the only unique countries

This is what I have tried.

all_countries = countries.select('div#country-box ul li a')

for index,value in enumerate(all_countries):
    print(value)
    all_countries[index] = value.text

all_countries = set(all_countries)
all_countries = list(all_countries)

for index,value in enumerate(all_countries):
    print(value)

Hmmm okay, I have now unique elements but it does not maintain Order of those countries as they appear on site in that MultiSelectList and I also need values of attributes data-id and href and also the text of a tag for later use in my script.

If I do

all_countries = countries.select('div#country-box ul li a')
all_countries = set(all_countries)

all_countries = list(all_countries)

Would it be a good approach?

Using set to store already-seen data-id s.

from bs4 import BeautifulSoup


def iter_uniq_link(all_countries):
    seen = set()
    for c in all_countries:
        data_id = c.get('data-id')
        if data_id not in seen:
            seen.add(data_id)
            yield c

Usage:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <body>
...     <div id="country-box">
...         <ul>
...             <li>
...                 <a  data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
...                 <a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
...                 <a  data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
...                 <a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
...                 <a  data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
...                 <a  data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
...                 <a  data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
...                 <a  data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
...                 <a  data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
...                 <a  data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
...             </li>
...         </ul>
...     </div>
... </body>
... ''')
>>> all_countries = soup.select('div#country-box ul li a')
>>> list(iter_uniq_link(all_countries))
[<a data-flexible="" data-id="AU" href="http://www.wotif.com/AU" select="">Australia</a>,
 <a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>,
 <a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>,
 <a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>,
 <a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>,
 <a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>,
 <a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>]

One possible way to maintain order and uniqueness is to use an OrderedDict. Add each unique value of data-id into the OrderedDict as a key.

https://docs.python.org/3.3/library/collections.html#collections.OrderedDict

Adding keys into such a dictionary will preserve their order of insertion when you iterate through it (with .keys() , for example).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM