简体   繁体   English

BeautifulSoup - 在嵌套的 for 循环内将字符串值转换为 int 然后排序

[英]BeautifulSoup - Converting string values into int inside of nested for loop then sort

I'm trying to figure out how to convert a string value into an int within a scraped for loop in order to sort by the int ('views' within the below script).我试图弄清楚如何在抓取的 for 循环中将字符串值转换为 int 以便按 int (下面脚本中的“视图”)进行排序。

Below is a condensed view of the problem.下面是问题的简要视图。 Inclduing a working script that returns the string, my failed attempt to fix the issue, and the desired output.包括返回字符串的工作脚本、我解决问题的失败尝试以及所需的输出。

Working script that returns the string:返回字符串的工作脚本:

import requests  
from bs4 import BeautifulSoup  
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = subtext[idx].find_all(
            'li')[2].text.strip().replace(' Reads', '')
        sej.append({'title': title, 'link': href, 'views': views})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

Within the above, the output contains dictionaries that look like the below:在上面,输出包含如下所示的字典:

 {
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site                ',
'views': '4.5K'
}

The desired output would be:所需的输出是:

 {
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site                ',
'views': '4500'
}

My failed attempt to fix the problem is below.我解决问题的失败尝试如下。 The below script returns a single value, rather than the list of all applicable values, but i'm honestly not certain if i'm going about this in the correct way.下面的脚本返回单个值,而不是所有适用值的列表,但老实说,我不确定我是否以正确的方式进行处理。

import requests
from bs4 import BeautifulSoup
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = subtext[idx].find_all(
            'li')[2].text.strip().replace(' Reads', '').replace(' min  read', '')
# below is my unsuccessful attempt to change the strings to int
        for item in views:
            if views:
                multiplier = 1
                if views.endswith('K'):
                    multiplier = 1000
                    views = views[0:len(views)-1]
                return int(float(views) * multiplier)
            else:
                return views
        sej.append({'title': title, 'link': href, 'views': views})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

Any help would be appreciated!任何帮助,将不胜感激!

Thanks.谢谢。

You can try this code to convert the views to integer:您可以尝试使用此代码将视图转换为整数:

import requests  
from bs4 import BeautifulSoup  
import pprint

res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')


def convert(views):
    if 'K' in views:
        return int( float( views.split('K')[0] ) * 1000 )
    else:
        return int(views)

def sort_stories_by_views(sejlist):
    return sorted(sejlist, key=lambda k: k['views'], reverse=True)


def create_custom_sej(links, subtext):
    sej = []

    for idx, item in enumerate(links):
        title = links[idx].getText()
        href = links[idx].a.get('href', None)
        views = item.parent.find('i', class_='sej-meta-icon fa fa-eye')
        views = views.find_next(text=True).split()[0] if views else '0'
        sej.append({'title': title, 'link': href, 'views': convert(views)})
    return sort_stories_by_views(sej)


create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))

Prints:印刷:

[{'link': 'https://www.searchenginejournal.com/microsoft-clarity-analytics/385867/',
  'title': 'Microsoft Announces Clarity – Free Website '
           'Analytics                ',
  'views': 11000},
 {'link': 'https://www.searchenginejournal.com/wordpress-5-6-feature-removed-for-subpar-experience/385414/',
  'title': 'WordPress 5.6 Feature Removed For Subpar '
           'Experience                ',
  'views': 7000},
 {'link': 'https://www.searchenginejournal.com/whatsapp-shopping-payment-customer-service/385362/',
  'title': 'WhatsApp Announces Shopping and Payment Tools for '
           'Businesses                ',
  'views': 6800},
 {'link': 'https://www.searchenginejournal.com/google-noindex-meta-tag-proper-use/385538/',
  'title': 'Google Shares How Noindex Meta Tag Can Cause '
           'Issues                ',
  'views': 6500},

...and so on.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM