简体   繁体   中英

bs4 soup.select() vs. soup.find()

I am trying to scrape the text of some elements in a table using requests and BeautifulSoup , specifically the country names and the 2-letter country codes from this website .

Here is my code, which I have progressively walked back:

import requests
import bs4

res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)

for i in range(3):
    row = soup.find(f'#row{i} td')
    print(row) # printing to check progress for now

I had hoped to go row-by-row and walk the tags to get the strings like so (over range 249). However, soup.find() doesn't appear to work, just prints blank lists. soup.select() however, works fine:

for i in range(3):
    row = soup.select(f'#row{i} td')
    print(row)

Why does soup.find() not work as expected here?

find expects the first argument to be the DOM element you're searching, it won't work with CSS selectors.

So you'll need:

row = soup.find('tr', { 'id': f"row{i}" })

To get the tr with the desired ID.


Then to get the 2-letter country code, for the first a with title: ISO 3166-1 alpha-2 code and get it's .text :

iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text

To get the full name, there is no classname to search for, so I'd use take the second element, then we'll need to search for the span containing the country name:

name = row.findAll('td')[2].findAll('span')[2].text

Putting it all together gives:

import requests
import bs4

res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

for i in range(3):
    row = soup.find('tr', { 'id': f"row{i}" })

    iso = row.find('a', { 'title': 'ISO 3166-1 alpha-2 code' }).text
    name = row.findAll('td')[2].findAll('span')[2].text

    print(name, iso)

Which outputs:

Afghanistan  AF
Åland Islands  AX
Albania  AL

find_all() and select() select a list but find() and select_one() select only single element.

import requests
import bs4
import pandas as pd

res = requests.get('https://country-code.cl/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,'lxml')

data=[]
for row in soup.select('.tablesorter.mark > tbody tr'):
    name=row.find("span",class_="sortkey").text
    country_code=row.select_one('td:nth-child(4)').text.replace('\n','').strip()

    data.append({
        'name':name,
        'country_code':country_code})

df= pd.DataFrame(data)
print(df)

Output:

                 name    country_code
0          afghanistan           AF
1        aland-islands           AX
2              albania           AL
3              algeria           DZ
4       american-samoa           AS
..                 ...          ...
244  wallis-and-futuna           WF
245     western-sahara           EH
246              yemen           YE
247             zambia           ZM
248           zimbabwe           ZW

[249 rows x 2 columns]

While .find() deals only with the first occurence of an element, .select() / .find_all() will give you a ResultSet you can iterate.

There are a lot of ways to get your goal, but basic pattern is mostly the same - select rows of the table and iterate over them.

In this first case I selected table by its id and close to your initial approach the <tr> also by its id while using css selector and the [id^="row"] that represents id attribute whose value starts with row . In addition I used .stripped_strings to extract the text from the elements, stored it in a list and pick it by index :

for row in soup.select('#countriesTable tr[id^="row"]'):
    row = list(row.stripped_strings)
    print(row[2], row[3])

or more precisely selecting all <tr> in <tbody> of tag with id countriesTable :

for row in soup.select('#countriesTable tbody tr'):
    row = list(row.stripped_strings)
    print(row[2], row[3])

...


An alternative and in my opinion best way to scrape tables is the use of pandas.read_html() that works with beautifulsoup under the hood and is doing most work for you:

import pandas as pd
pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,:]

or to get only the two specific rows:

pd.read_html('https://country-code.cl/', attrs={'id':'countriesTable'})[0].dropna(axis=1, how='all').iloc[:-1,[1,2]]
Name ISO 2
0 Afghanistan AF
1 Åland Islands AX
2 Albania AL
3 Algeria DZ
4 American Samoa AS
5 Andorra AD

...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM