简体   繁体   中英

BeautifulSoup - Converting string values into int inside of nested for loop then sort

I'm trying to figure out how to convert a string value into an int within a scraped for loop in order to sort by the int ('views' within the below script).

Below is a condensed view of the problem. Inclduing a working script that returns the string, my failed attempt to fix the issue, and the desired output.

Working script that returns the string:

import requests  
from bs4 import BeautifulSoup  
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = subtext[idx].find_all(
            'li')[2].text.strip().replace(' Reads', '')
        sej.append({'title': title, 'link': href, 'views': views})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

Within the above, the output contains dictionaries that look like the below:

 {
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site                ',
'views': '4.5K'
}

The desired output would be:

 {
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site                ',
'views': '4500'
}

My failed attempt to fix the problem is below. The below script returns a single value, rather than the list of all applicable values, but i'm honestly not certain if i'm going about this in the correct way.

import requests
from bs4 import BeautifulSoup
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = subtext[idx].find_all(
            'li')[2].text.strip().replace(' Reads', '').replace(' min  read', '')
# below is my unsuccessful attempt to change the strings to int
        for item in views:
            if views:
                multiplier = 1
                if views.endswith('K'):
                    multiplier = 1000
                    views = views[0:len(views)-1]
                return int(float(views) * multiplier)
            else:
                return views
        sej.append({'title': title, 'link': href, 'views': views})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

Any help would be appreciated!

Thanks.

You can try this code to convert the views to integer:

import requests  
from bs4 import BeautifulSoup  
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def convert(views):
    if 'K' in views:
        return int( float( views.split('K')[0] ) * 1000 )
    else:
        return int(views)

def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = item.parent.find('i', class_='sej-meta-icon fa fa-eye')
        views = views.find_next(text=True).split()[0] if views else '0'
        sej.append({'title': title, 'link': href, 'views': convert(views)})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

Prints:

[{'link': 'https://www.searchenginejournal.com/microsoft-clarity-analytics/385867/',
  'title': 'Microsoft Announces Clarity – Free Website '
           'Analytics                ',
  'views': 11000},
 {'link': 'https://www.searchenginejournal.com/wordpress-5-6-feature-removed-for-subpar-experience/385414/',
  'title': 'WordPress 5.6 Feature Removed For Subpar '
           'Experience                ',
  'views': 7000},
 {'link': 'https://www.searchenginejournal.com/whatsapp-shopping-payment-customer-service/385362/',
  'title': 'WhatsApp Announces Shopping and Payment Tools for '
           'Businesses                ',
  'views': 6800},
 {'link': 'https://www.searchenginejournal.com/google-noindex-meta-tag-proper-use/385538/',
  'title': 'Google Shares How Noindex Meta Tag Can Cause '
           'Issues                ',
  'views': 6500},

...and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM