[英]how to scrape links from hidden span class HTML?
I'm learning web scraping as I scrape real world data from real websites.当我从真实网站上抓取真实世界的数据时,我正在学习 web 抓取。 Yet, I've never ran into this type of issues until now.
然而,到目前为止,我从未遇到过此类问题。 One can usually search for wanted HTML source codes by right-clicking the part of the websites and then clicking inspect option.
通常可以通过右键单击网站部分然后单击检查选项来搜索想要的 HTML 源代码。 I'll jump to the example right away to explain the issue.
我会马上跳到这个例子来解释这个问题。
From the above picture, the red color marked span class is not there originally but when I put(did not even click) my cursor on a user's name, a small box for that user pops up and also that span class shows up.从上图中,红色标记的跨度 class 原本不存在,但是当我将(甚至没有点击)我的 cursor 放在用户名上时,会弹出一个该用户的小框,并且还会显示跨度 ZA2F2ED4F8DCEBC2CBBD4ZC21A26。 What I ultimately want to scrape is the link address for a user's profile which is embedded inside of that span class.I'm not sure but IF I can parse that span class, I guess I can try to scrape the link address but I keep failing to parse that hidden span class.
我最终想要抓取的是嵌入在该跨度 class 内的用户配置文件的链接地址。我不确定,但如果我可以解析该跨度 class,我想我可以尝试抓取链接地址,但我保留未能解析该隐藏跨度 class。
I didn't expect that much but my codes of course gave me the empty list because that span class didn't show up when my cursor was not on the user's name.我没想到那么多,但我的代码当然给了我一个空列表,因为当我的 cursor 不在用户名上时,跨度 class 没有出现。 But I show my code to show what I've done.
但我展示我的代码来展示我所做的事情。
from bs4 import BeautifulSoup
from selenium import webdriver
#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")
#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)
#parse html
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")
hidden=soup.find_all("span", class_="ui_overlay ui_popover arrow_left")
print (hidden)
Are there any simple and intuitive ways to parse that hidden span class using selenium?是否有任何简单直观的方法可以使用 selenium 解析隐藏跨度 class? If I can parse it, I may use 'find' function to parse the link address for a user and then loop over all the users to get all the link addresses.
如果我能解析它,我可以使用'find' function 来解析用户的链接地址,然后遍历所有用户以获取所有链接地址。 Thank you.
谢谢你。
=======================updated the question by adding below=================== =======================通过添加以下内容更新了问题===================
To add some more detailed explanations on what I want to retrieve, I want to get the link that is pointed with a red arrow from the below picture.为了对我想要检索的内容添加一些更详细的解释,我想从下图中获取用红色箭头指向的链接。 Thank you for pointing out that I need more explanations.
感谢您指出我需要更多解释。
==========================updated code so far===================== ===========================到目前为止更新的代码==================== =
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")
#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)
profile=driver.find_element_by_xpath("//div[@class='mainContent']")
profile_pic=profile.find_element_by_xpath("//div[@class='ui_avatar large']")
ActionChains(driver).move_to_element(profile_pic).perform()
ActionChains(driver).move_to_element(profile_pic).click().perform()
#So far I could successfully hover over the first user. A few issues occur after this line.
#The error message says "type object 'By' has no attribute 'xpath'". I thought this would work since I searched on the internet how to enable this function.
waiting=wait(driver, 5).until(EC.element_to_be_clickable((By.xpath,('//span//a[contains(@href,"/Profile/")]'))))
#This gives me also a error message saying that "unable to locate the element".
#Some of the ways to code in Python and Java were different so I searched how to get the value of the xpath which contains "/Profile/" but gives me an error.
profile_box=driver.find_element_by_xpath('//span//a[contains(@href,"/Profile/")]').get_attribute("href")
print (profile_box)
Also, is there any way to iterate through xpath in this case?另外,在这种情况下,有什么方法可以遍历 xpath 吗?
I think you can use requests library instead of selenium.我认为您可以使用请求库而不是 selenium。
When you hover on username, you will get Request URL as below.当您在用户名上使用 hover 时,您将收到如下请求 URL。
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html')
print(html.status_code)
soup = BeautifulSoup(html.content, 'html.parser')
# Find all UID of username
# Split the string "UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293" into UID, SRC
# And recombine to Request URL
name = soup.find_all('div', class_="memberOverlayLink")
for i in name:
print(i.get('id'))
# Use url to get profile link
response = requests.get('https://www.tripadvisor.com/MemberOverlay?Mode=owa&uid=805E0639C29797AEDE019E6F7DA9FF4E&c=&src=507403702&fus=false&partner=false&LsoId=&metaReferer=')
soup = BeautifulSoup(response.content, 'html.parser')
result = soup.find('a')
print(result.get('href'))
This is output:这是 output:
200
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
/Profile/JLERPercy
If you want to use selenium to get popup box,如果要使用 selenium 来获取弹出框,
You can use ActionChains to do hover() function.你可以使用 ActionChains 来做 hover() function。
But I think it's less efficient than using requests.但我认为它比使用请求效率低。
from selenium.webdriver.common.action_chains import ActionChains
ActionChains(driver).move_to_element(element).perform()
The below code will extract the href value.Try and let me know how it goes.下面的代码将提取 href 值。尝试让我知道它是如何进行的。
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome('/usr/local/bin/chromedriver') # Optional argument, if not specified will search path.
driver.implicitly_wait(15)
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");
#finds all the comments or profile pics
profile_pic= driver.find_elements(By.XPATH,"//div[@class='prw_rup prw_reviews_member_info_hsx']//div[@class='ui_avatar large']")
for i in profile_pic:
#clicks all the profile pic one by one
ActionChains(driver).move_to_element(i).perform()
ActionChains(driver).move_to_element(i).click().perform()
#print the href or link value
profile_box=driver.find_element_by_xpath('//span//a[contains(@href,"/Profile/")]').get_attribute("href")
print (profile_box)
driver.quit()
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.interactions.Actions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;
public class Selenium {
public static void main(String[] args) {
System.setProperty("webdriver.chrome.driver", "./lib/chromedriver");
WebDriver driver = new ChromeDriver();
driver.manage().window().maximize();
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");
//finds all the comments or profiles
List<WebElement> profile= driver.findElements(By.xpath("//div[@class='prw_rup prw_reviews_member_info_hsx']//div[@class='ui_avatar large']"));
for(int i=0;i<profile.size();i++)
{
//Hover on user profile photo
Actions builder = new Actions(driver);
builder.moveToElement(profile.get(i)).perform();
builder.moveToElement(profile.get(i)).click().perform();
//Wait for user details pop-up
WebDriverWait wait = new WebDriverWait(driver, 10);
wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//span//a[contains(@href,'/Profile/')]")));
//Extract the href value
String hrefvalue=driver.findElement(By.xpath("//span//a[contains(@href,'/Profile/')]")).getAttribute("href");
//Print the extracted value
System.out.println(hrefvalue);
}
//close the browser
driver.quit();
}
}
output output
https://www.tripadvisor.com/Profile/861kellyd https://www.tripadvisor.com/Profile/JLERPercy https://www.tripadvisor.com/Profile/rayn817 https://www.tripadvisor.com/Profile/grossla https://www.tripadvisor.com/Profile/kapmem
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.