简体   繁体   English

使用硒在 Python 上进行网页抓取时找不到 div 类

[英]Can't find div class while webscraping on Python using selenium

I'll preface by saying that I've seen similar questions, but none of the solutions worked for me我先说我见过类似的问题,但没有一个解决方案对我有用

So I'm looking for a specific class in my html page, but I always get a None value returned.所以我在我的 html 页面中寻找一个特定的类,但我总是得到一个 None 返回值。 I've seen a few posts on here describing the same problem, but none of the solutions have worked for me.我在这里看到了一些描述相同问题的帖子,但没有一个解决方案对我有用。 Here are my attempts - I'm looking for the player tags with their names, ie 'Chase Young'这是我的尝试 - 我正在寻找带有姓名的球员标签,即“Chase Young”

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests

url = "https://www.nfl.com/draft/tracker/prospects/allPositions?
college=allColleges&page=1&status=ALL&year=2020"

soup = BeautifulSoup(url.content, 'lxml')
match = soup.find('div', class_ = 'css-gu7inl')
print(match)
# Prints None

I tried another method to find the match, still returned None:我尝试了另一种方法来查找匹配项,仍然返回 None:

match = soup.find("div", {"class": "css-gu7inl"} # Print match is None

It appears that the html file does not contain all of the webpage, so I tried using selenium as I've seen recommended on similar post, and still got nothing:看起来 html 文件不包含所有网页,所以我尝试使用 selenium,因为我在类似帖子中看到了推荐,但仍然没有得到任何结果:

driver = webdriver.Chrome("chromedriver")
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
items=soup.select(".css-gu7inl")
print(items) # Empty list

What am i doing wrong here?我在这里做错了什么?

Data is rendered by java scripts hence Induce WebDriverWait () and wait for the element to visible using visibility_of_all_elements_located ()数据由java脚本呈现,因此Induce WebDriverWait ()并使用visibility_of_all_elements_located ()等待元素可见

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

url='https://www.nfl.com/draft/tracker/prospects/allPositions?college=allColleges&page=1&status=ALL&year=2020'
driver = webdriver.Chrome()
driver.get(url)
WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,'.css-gu7inl')))
soup = BeautifulSoup(driver.page_source, 'lxml')
items=soup.select(".css-gu7inl")
Players=[item.select_one('a.css-1fwlqa').text for item in items]
print(Players) 

Output :输出

['chase young', 'jeff okudah', 'derrick brown', 'isaiah simmons', 'joe burrow', "k'lavon chaisson", 'jedrick wills', 'tua tagovailoa', 'ceedee lamb', 'jerry jeudy', "d'andre swift", 'c.j. henderson', 'mekhi becton', 'mekhi becton', 'patrick queen', 'henry ruggs iii', 'henry ruggs iii', 'javon kinlaw', 'laviska shenault jr.', 'yetur gross-matos']

Code number one helps you see the response from the server.代码 1 可帮助您查看来自服务器的响应。 This response contains HTML code sent by the server.此响应包含服务器发送的 HTML 代码。 Analyze the response(HTML code from server) of this code with another code and separate the class you want.使用另一个代码分析此代码的响应(来自服务器的 HTML 代码)并分离您想要的类。

================================================== ==================================================

import requests #CODE1
from requests_toolbelt.utils import dump

resp = requests.get('http://kanoon.ir/')
data = dump.dump_all(resp)
print(data.decode('utf-8')) 

=================================================== ================================================== =

The output of code: HTML code:代码输出: HTML 代码:

< GET / HTTP/1.1

< Host: kanoon.ir

< User-Agent: python-requests/2.23.0

< Accept-Encoding: gzip, deflate

< Accept: */*

< Connection: keep-alive

< 
     ...

=================================================== ================================================== =

The code you write for the second part(for Analysis and HTML code separation) depends on your creativity.您为第二部分(用于分析和 HTML 代码分离)编写的代码取决于您的创造力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM