[英]Web scraping using Selenium in python - trouble retrieving all data
I am trying to webscrape coinmarketcap.com using selenium where I am trying to retrieve data such as coin name, coinmarket cap, price and circulation supply.我正在尝试使用 selenium 对 coinmarketcap.com 进行网络抓取,我正在尝试检索诸如硬币名称、硬币市场上限、价格和流通供应等数据。 However, I am not successful with this.
但是,我在这方面并不成功。 I am only able to retrieve 11 alt coins and not more.
我只能取回 11 个山寨币,不能更多。 Also, I have looked into several ways how to render javascrip (which I presume coinmarketcap is made in) using different methods.
另外,我研究了几种如何使用不同的方法渲染 javascrip(我假设 coinmarketcap 是在其中制作的)的方法。 Here is the start of my code:
这是我的代码的开始:
driver = webdriver.Chrome(r'C:\Users\Ejer\PycharmProjects\pythonProject\chromedriver')
driver.get('https://coinmarketcap.com/')
Crypto = driver.find_elements_by_xpath("//div[contains(concat(' ', normalize-space(@class), ' '), 'sc-16r8icm-0 sc-1teo54s-1 lgwUsc')]")
#price = driver.find_elements_by_xpath('//td[@class="cmc-link"]')
#coincap = driver.find_elements_by_xpath('//td[@class="DAY"]')
CMC_list = []
for c in range(len(Crypto)):
CMC_list.append(Crypto[c].text)
print(CMC_list)
driver.close()
My goal is to store the names, coinmarket cap, price and circulation supply in a dataframe so I can apply machine learning methods and analyze the data.我的目标是将名称、硬币市值、价格和流通量存储在 dataframe 中,以便我可以应用机器学习方法并分析数据。 So, I am open to any suggestions.
所以,我愿意接受任何建议。 Thank in advance
预先感谢
To retrieve the list of coin names you need to close the cookies bar, close the popup and induce WebDriverWait for the visibility_of_all_elements_located()
and you can use either of the following Locator Strategies :要检索硬币名称列表,您需要关闭cookies栏,关闭弹出窗口并为
visibility_of_all_elements_located()
诱导WebDriverWait ,您可以使用以下任一定位器策略:
Using CSS_SELECTOR
and get_attribute("innerHTML")
:使用
CSS_SELECTOR
和get_attribute("innerHTML")
:
driver.get("https://coinmarketcap.com/") WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div.cmc-cookie-policy-banner__close"))).click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button/b[text()='No, thanks']"))).click() print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.cmc-table tbody tr td > ap[color='text']")))])
Using XPATH
and text attribute:使用
XPATH
和文本属性:
driver.get("https://coinmarketcap.com/") WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div.cmc-cookie-policy-banner__close"))).click() WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button/b[text()='No, thanks']"))).click() print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[contains(@class, 'cmc-table')]//tbody//tr//td/a//p[@color='text']")))]) driver.quit()
Console Output:控制台 Output:
['Bitcoin', 'Ethereum', 'XRP', 'Tether', 'Litecoin', 'Bitcoin Cash', 'Chainlink', 'Cardano', 'Polkadot', 'Binance Coin', 'Stellar', 'USD Coin', 'Bitcoin SV']
Note : You have to add the following imports:注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
Facing the same problem, I added a page scrolling before Crypto = driver.find_elements_by_xpath... like this:面对同样的问题,我在 Crypto = driver.find_elements_by_xpath 之前添加了一个页面滚动...像这样:
i=0
while i<15:
driver.execute_script("window.scrollBy(0, window.innerHeight)")
time.sleep(SCROLL_PAUSE_TIME)
i+=1
Crypto = driver.find_elements_by_xpath('//div[@class="sc-16r8icm-0 sc-1teo54s-0 dBKWCw"]')
On my laptop, scrolling down the page for 13 times is enough to get refreshed all 100 coins.在我的笔记本电脑上,向下滚动页面 13 次足以刷新所有 100 个硬币。 I put 15 just to be sure.
为了确定,我放了 15 个。 The next step is to get the refreshed content.
下一步是获取刷新的内容。 Perhaps I have to repeat scrolling every 1 or 2 minutes to get it.
也许我必须每 1 或 2 分钟重复一次滚动才能获得它。 My first post here.
我在这里的第一篇文章。 Hard enough to insert the code.
很难插入代码。 I hope it's useful
我希望它有用
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.