使用 Python 进行网页抓取

Question

I'm trying to get data from a list of companies (currently testing only for one) from a website.我试图从一个网站的公司列表（目前只测试一个）中获取数据。 I am not sure I can recognise how to get the score that I want because I can only find the formatting part instead of the actual data.我不确定我能否识别如何获得我想要的分数，因为我只能找到格式部分而不是实际数据。 Please could someone help?请问有人可以帮忙吗？

from selenium import webdriver
import time
from selenium.webdriver.support.select import Select

driver=webdriver.Chrome(executable_path='C:\webdrivers\chromedriver.exe')

driver.get('https://www.refinitiv.com/en/sustainable-finance/esg-scores')

driver.maximize_window()
time.sleep(1)

cookie= driver.find_element("xpath", '//button[@id="onetrust-accept-btn-handler"]')
try:
    cookie.click()
except:
    pass

company_name=driver.find_element("id",'searchInput-1')
company_name.click()
company_name.send_keys('Jumbo SA')
time.sleep(1)

search=driver.find_element("xpath", '//button[@class="SearchInput-searchButton"]')
search.click()
time.sleep(2)

company_score = driver.find_elements("xpath",'//div[@class="fiscal-year"]')

print(company_score)

That's what I have so far.这就是我到目前为止所拥有的。 I want the number "42" to come back to my results but instead I get the below;我希望数字“42”返回到我的结果中，但我得到了以下结果；

[<selenium.webdriver.remote.webelement.WebElement (session="bffa2fe80dd3785618b5c52d7087096d", element="62eaf2a8-d1a2-4741-8374-c0f970dfcbfe")>] [<selenium.webdriver.remote.webelement.WebElement (session="bffa2fe80dd3785618b5c52d7087096d", element="62eaf2a8-d1a2-4741-8374-c0f970dfcbfe")>]

My issue is that the locator is not working.我的问题是定位器不工作。

//div[@class="fiscal-year"] = This part I think is wrong - but I am not sure what I need to pick from the website. //div[@class="fiscal-year"] = 这部分我认为是错误的 - 但我不确定我需要从网站上挑选什么。

Website Screenshot网站截图

Answer 1

please use requests look at this example:请使用请求看这个例子：

import requests

url = "https://www.refinitiv.com/bin/esg/esgsearchsuggestions"

payload = ""
response = requests.request("GET", url, data=payload)

print(response.text)

so this returns something like this:所以这会返回这样的东西：

[
{
        "companyName": "GEK TERNA Holdings Real Estate Construction SA",
        "ricCode": "HRMr.AT"
    },
    {
        "companyName": "Mytilineos SA",
        "ricCode": "MYTr.AT"
    },
    {
        "companyName": "Hellenic Telecommunications Organization SA",
        "ricCode": "OTEr.AT"
    },
    {
        "companyName": "Jumbo SA",
        "ricCode": "BABr.AT"
    },
    {
        "companyName": "Folli Follie Commercial Manufacturing and Technical SA",
        "ricCode": "HDFr.AT"
    },
    {
]

Here we can see the text and the code behind it so for Jumbo SA its BABr.AT.在这里我们可以看到它背后的文本和代码，因此对于 Jumbo SA 来说，它是 BABr.AT。 Now with this info lets get the data:现在有了这个信息让我们获取数据：

import requests

url = "https://www.refinitiv.com/bin/esg/esgsearchresult"

querystring = {"ricCode":"BABr.AT"} ## supply the company code

payload = ""
headers = {"cookie": "encaddr=NeVecfNa7%2FR1rLeYOqY57g%3D%3D"}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

Now we see the response is in json:现在我们看到响应在 json 中：

{
    "industryComparison": {
        "industryType": "Specialty Retailers",
        "scoreYear": "2020",
        "rank": "162",
        "totalIndustries": "281"
    },
    "esgScore": {
        "TR.TRESGCommunity": {
            "score": 24,
            "weight": 0.13
        },
        "TR.TRESGInnovation": {
            "score": 9,
            "weight": 0.05
        },
        "TR.TRESGHumanRights": {
            "score": 31,
            "weight": 0.08
        },
        "TR.TRESGShareholders": {
            "score": 98,
            "weight": 0.08
        },
        "TR.SocialPillar": {
            "score": 43,
            "weight": 0.42999998
        },
        "TR.TRESGEmissions": {
            "score": 19,
            "weight": 0.08
        },
        "TR.TRESGManagement": {
            "score": 47,
            "weight": 0.26
        },
        "TR.GovernancePillar": {
            "score": 53,
            "weight": 0.38999998569488525
        },
        "TR.TRESG": {
            "score": 42,
            "weight": 1
        },
        "TR.TRESGWorkforce": {
            "score": 52,
            "weight": 0.1
        },
        "TR.EnvironmentPillar": {
            "score": 20,
            "weight": 0.19
        },
        "TR.TRESGResourceUse": {
            "score": 30,
            "weight": 0.06
        },
        "TR.TRESGProductResponsibility": {
            "score": 62,
            "weight": 0.12
        },
        "TR.TRESGCSRStrategy": {
            "score": 17,
            "weight": 0.05
        }
    }
}

Now you can get the data you want without using selenium.现在您无需使用 selenium 即可获得所需的数据。 This way its faster, easier and better.这样它更快、更容易、更好。

Please accept this as an answer.请接受这个作为答案。

使用 Python 进行网页抓取

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-12-19 08:39:33

使用 Python 进行网页抓取

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-12-19 08:39:33

解决方案1
0 已采纳 2022-12-19 08:39:33