简体   繁体   English

使用美丽的汤从网页中的链接中抓取数据。 Python

[英]Scrape data from a link in a webpage using beautiful soup. python

I am trying to scrape data (instaid, average likes, average comments) from a url inside the webpage: , https://starngage.com/app/global/influencer/ranking/india我正在尝试从网页内的 url 中抓取数据(instaid、平均喜欢、平均评论):, https ://starngage.com/app/global/influencer/ranking/india

The element id of the url is : @priyankachopra url 的元素 id 是:@priyankachopra

Similary I want to scrape data from all 1000 profiles in the same table类似地,我想从同一个表中的所有 1000 个配置文件中抓取数据

Can someone tell me how to do this有人能告诉我怎么做吗

from bs4 import BeautifulSoup
from prettytable import PrettyTable

tb = PrettyTable(['Name', 'Insta_ID', 'Followers'])
url = 'https://starngage.com/app/global/influencer/ranking/india'
resp = requests.get(url)

soup = BeautifulSoup(resp.text, 'html.parser')

table = soup.find('table', class_='table-responsive-sm')
td = table.findAll('tr')

for i in td[1:]:
    temp = i.select_one("td:nth-of-type(3)").text
    name, insta_id = temp.split('@')
    followers = i.select_one("td:nth-of-type(6)").text
    tb.add_row([name.strip(), insta_id.strip(), followers.strip()])

print(tb)

You can do this, I hadn't exactly tested complete code because it will take very much time it may take upto 10mins but I had tested part part and is working perfectly fine for me.你可以这样做,我没有完全测试完整的代码,因为它需要很长时间,可能需要 10 分钟,但我已经测试了部分,并且对我来说工作得很好。 But if not working ask me in comment.但如果不起作用,请在评论中问我。 Here's code:这是代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

ids=[]
avgc=[]
avgl=[]
for i in range(1,101):
    url = f'https://starngage.com/app/global/influencer/ranking/india?page={i}'
    print(url)
    resp = requests.get(url)
    
    soup = BeautifulSoup(resp.text, 'lxml')
    
    table = soup.find('table', class_='table-responsive-sm')
    trs = table.findAll('tr')
    
    for tr in trs[1:]:
        temp = tr.select_one("td:nth-of-type(3)").text
        _,insta_id = temp.split('@')
        ids.append(insta_id.strip())

for id in ids:
    page=requests.get("https://starngage.com/app/global/influencers/"+id)
    soup=BeautifulSoup(page.content, 'lxml')
    
    x=soup.find("blockquote").find("p").text.strip()
    #You can change this re code. I am not much familar with re. So, if you find any other better approch then comment.
    x=re.findall("is \d+",x)
    avl,avc=list(map(lambda y: y.replace("is ",""),x))
    avgl.append(avl)
    avgc.append(avc)

df = pd.DataFrame({"Insta Id":ids,"Avgerage Like":avgl,"Avgerage Commment":avgc})

print(df)

df.to_csv("test.csv")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM