[英]BeautifulSoup - Converting string values into int inside of nested for loop then sort
我试图弄清楚如何在抓取的 for 循环中将字符串值转换为 int 以便按 int (下面脚本中的“视图”)进行排序。
下面是问题的简要视图。 包括返回字符串的工作脚本、我解决问题的失败尝试以及所需的输出。
返回字符串的工作脚本:
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')
def sort_stories_by_views(sejlist):
return sorted(sejlist, key=lambda k: k['views'], reverse=True)
def create_custom_sej(links, subtext):
sej = []
for idx, item in enumerate(links):
title = links[idx].getText()
href = links[idx].a.get('href', None)
views = subtext[idx].find_all(
'li')[2].text.strip().replace(' Reads', '')
sej.append({'title': title, 'link': href, 'views': views})
return sort_stories_by_views(sej)
create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))
在上面,输出包含如下所示的字典:
{
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site ',
'views': '4.5K'
}
所需的输出是:
{
'link': 'https://www.searchenginejournal.com/google-answers-if-site-section-can-impact-ranking-scores-of-
'title': 'Google Answers If Site Section Can Impact Ranking Score of Entire ''Site ',
'views': '4500'
}
我解决问题的失败尝试如下。 下面的脚本返回单个值,而不是所有适用值的列表,但老实说,我不确定我是否以正确的方式进行处理。
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')
def sort_stories_by_views(sejlist):
return sorted(sejlist, key=lambda k: k['views'], reverse=True)
def create_custom_sej(links, subtext):
sej = []
for idx, item in enumerate(links):
title = links[idx].getText()
href = links[idx].a.get('href', None)
views = subtext[idx].find_all(
'li')[2].text.strip().replace(' Reads', '').replace(' min read', '')
# below is my unsuccessful attempt to change the strings to int
for item in views:
if views:
multiplier = 1
if views.endswith('K'):
multiplier = 1000
views = views[0:len(views)-1]
return int(float(views) * multiplier)
else:
return views
sej.append({'title': title, 'link': href, 'views': views})
return sort_stories_by_views(sej)
create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))
任何帮助,将不胜感激!
谢谢。
您可以尝试使用此代码将视图转换为整数:
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://www.searchenginejournal.com/category/news/')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all('h2', class_='sej-ptitle')
subtext = soup.find_all('ul', class_='sej-meta-cells')
def convert(views):
if 'K' in views:
return int( float( views.split('K')[0] ) * 1000 )
else:
return int(views)
def sort_stories_by_views(sejlist):
return sorted(sejlist, key=lambda k: k['views'], reverse=True)
def create_custom_sej(links, subtext):
sej = []
for idx, item in enumerate(links):
title = links[idx].getText()
href = links[idx].a.get('href', None)
views = item.parent.find('i', class_='sej-meta-icon fa fa-eye')
views = views.find_next(text=True).split()[0] if views else '0'
sej.append({'title': title, 'link': href, 'views': convert(views)})
return sort_stories_by_views(sej)
create_custom_sej(links, subtext)
pprint.pprint(create_custom_sej(links, subtext))
印刷:
[{'link': 'https://www.searchenginejournal.com/microsoft-clarity-analytics/385867/',
'title': 'Microsoft Announces Clarity – Free Website '
'Analytics ',
'views': 11000},
{'link': 'https://www.searchenginejournal.com/wordpress-5-6-feature-removed-for-subpar-experience/385414/',
'title': 'WordPress 5.6 Feature Removed For Subpar '
'Experience ',
'views': 7000},
{'link': 'https://www.searchenginejournal.com/whatsapp-shopping-payment-customer-service/385362/',
'title': 'WhatsApp Announces Shopping and Payment Tools for '
'Businesses ',
'views': 6800},
{'link': 'https://www.searchenginejournal.com/google-noindex-meta-tag-proper-use/385538/',
'title': 'Google Shares How Noindex Meta Tag Can Cause '
'Issues ',
'views': 6500},
...and so on.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.