[英]How do I move data scraped with Beautifulsoup to a MySQL database?
[英]Transform data that I scraped with BeautifulSoup
这是我第一次在 StockOverflow 上发帖。
我正在尝试抓取一个网站,但我得到的结果是我不知道如何在 dataframe 上进行转换以便可读。
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import requests
driver_path = 'C:\Program Files\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
url = 'https://data.similarweb.com/api/v1/data?domain=manomano.fr'
dic = {"test":[]}
page = requests.get(url)
soup = BeautifulSoup(url, 'html.parser')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')'''
我得到的结果是这样的:
<html><head><meta content="light dark" name="color-scheme"/></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"SiteName":"manomano.fr","Description":"manomano : tous vos produits de bricolage, rénovation et jardinage au meilleur prix","TopCountryShares":[{"Value":0.9582353790247653,"Country":250},{"Value":0.01428431578726121,"Country":56},{"Value":0.00360626497244031,"Country":756},{"Value":0.001907518367836589,"Country":124},{"Value":0.0016671906973079764,"Country":638}],"Title":"manomano : achat en ligne bricolage, rénovation et jardinage","Engagments":{"BounceRate":"0.39632307436566677","Month":"07","Year":"2022","PagePerVisit":"5.013184701586454","Visits":"1.710036669747373E7","TimeOnSite":"282.97140337977"},"EstimatedMonthlyVisits":{"2022-02-01":18289643,"2022-03-01":20571776,"2022-04-01":19341861,"2022-05-01":21415927,"2022-06-01":18153351,"2022-07-01":17100366},"GlobalRank":{"Rank":2656},"CountryRank":{"Country":250,"Rank":91},"IsSmall":false,"TrafficSources":{"Social":0.00617010722152418,"Paid Referrals":0.03439823545252397,"Mail":0.014748024044393673,"Referrals":0.026006210925393708,"Search":0.6444821136549667,"Direct":0.27419530870119785},"Category":"Home_and_Garden/Home_and_Garden","CategoryRank":{"Rank":"15","Category":"Home_and_Garden/Home_and_Garden"},"LargeScreenshot":"https://site-images.similarcdn.com/image?url=manomano.fr&t=1&h=2a480fb8d6d2298ffef39594ad8d71d65f5dbf8cba53179589d0c69e6aa3fd67"}</pre></body></html>
知道如何将这些数据转换为可读的东西(例如:dataframe?)
谢谢你的帮助,桑卡尔
为此,您不一定需要 selenium,它可以通过请求和 pandas 来完成:
import requests
import pandas as pd
header = {'Content-Type': 'application/json',
'Accept': 'application/json',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}
r = requests.get('https://data.similarweb.com/api/v1/data?domain=manomano.fr', headers=header)
# print(r.json())
df = pd.json_normalize(r.json())
print(df)
这将返回 dataframe:您可以提取 json 响应的其他部分以转换为 dataframe,如果您愿意:
SiteName Description TopCountryShares Title IsSmall Category LargeScreenshot Engagments.BounceRate Engagments.Month Engagments.Year ... CountryRank.Country CountryRank.Rank TrafficSources.Social TrafficSources.Paid Referrals TrafficSources.Mail TrafficSources.Referrals TrafficSources.Search TrafficSources.Direct CategoryRank.Rank CategoryRank.Category
0 manomano.fr manomano : tous vos produits de bricolage, rén... [{'Value': 0.9582353790247653, 'Country': 250}... manomano : achat en ligne bricolage, rénovatio... False Home_and_Garden/Home_and_Garden https://site-images.similarcdn.com/image?url=m... 0.39632307436566677 07 2022 ... 250 91 0.00617 0.034398 0.014748 0.026006 0.644482 0.274195 15 Home_and_Garden/Home_and_Garden
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.