I have countries list scraped from this website having values like
(Note: this is the output of all_countries
after iterating its elements)
<a data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
<a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
<a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
<a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
<a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
<a data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
<a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
<a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
What I want to do is that get the only unique countries
This is what I have tried.
all_countries = countries.select('div#country-box ul li a')
for index,value in enumerate(all_countries):
print(value)
all_countries[index] = value.text
all_countries = set(all_countries)
all_countries = list(all_countries)
for index,value in enumerate(all_countries):
print(value)
Hmmm okay, I have now unique elements but it does not maintain Order of those countries as they appear on site in that MultiSelectList and I also need values of attributes data-id
and href
and also the text of a
tag for later use in my script.
If I do
all_countries = countries.select('div#country-box ul li a')
all_countries = set(all_countries)
all_countries = list(all_countries)
Would it be a good approach?
Using set
to store already-seen data-id
s.
from bs4 import BeautifulSoup
def iter_uniq_link(all_countries):
seen = set()
for c in all_countries:
data_id = c.get('data-id')
if data_id not in seen:
seen.add(data_id)
yield c
Usage:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <body>
... <div id="country-box">
... <ul>
... <li>
... <a data-flexible="" SELECT data-id="AU" href="http://www.wotif.com/AU">Australia</a>
... <a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
... <a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>
... <a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
... <a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>
... <a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>
... <a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>
... <a data-flexible="" data-id="AU" href="http://www.wotif.com/AU">Australia</a>
... <a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>
... <a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>
... </li>
... </ul>
... </div>
... </body>
... ''')
>>> all_countries = soup.select('div#country-box ul li a')
>>> list(iter_uniq_link(all_countries))
[<a data-flexible="" data-id="AU" href="http://www.wotif.com/AU" select="">Australia</a>,
<a data-flexible="" data-id="NZ" href="http://www.wotif.com/NZ">New Zealand</a>,
<a data-flexible="" data-id="ID" href="http://www.wotif.com/ID">Indonesia</a>,
<a data-flexible="" data-id="TH" href="http://www.wotif.com/TH">Thailand</a>,
<a data-flexible="" data-id="SG" href="http://www.wotif.com/SG">Singapore</a>,
<a data-flexible="" data-id="GB" href="http://www.wotif.com/GB">United Kingdom</a>,
<a data-flexible="" data-id="AR" href="http://www.wotif.com/AR">Argentina</a>]
One possible way to maintain order and uniqueness is to use an OrderedDict. Add each unique value of data-id
into the OrderedDict as a key.
https://docs.python.org/3.3/library/collections.html#collections.OrderedDict
Adding keys into such a dictionary will preserve their order of insertion when you iterate through it (with .keys()
, for example).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.