简体   繁体   English

如何从隐藏跨度 class HTML 中抓取链接?

[英]how to scrape links from hidden span class HTML?

I'm learning web scraping as I scrape real world data from real websites.当我从真实网站上抓取真实世界的数据时,我正在学习 web 抓取。 Yet, I've never ran into this type of issues until now.然而,到目前为止,我从未遇到过此类问题。 One can usually search for wanted HTML source codes by right-clicking the part of the websites and then clicking inspect option.通常可以通过右键单击网站部分然后单击检查选项来搜索想要的 HTML 源代码。 I'll jump to the example right away to explain the issue.我会马上跳到这个例子来解释这个问题。

在此处输入图像描述

From the above picture, the red color marked span class is not there originally but when I put(did not even click) my cursor on a user's name, a small box for that user pops up and also that span class shows up.从上图中,红色标记的跨度 class 原本不存在,但是当我将(甚至没有点击)我的 cursor 放在用户名上时,会弹出一个该用户的小框,并且还会显示跨度 ZA2F2ED4F8DCEBC2CBBD4ZC21A26。 What I ultimately want to scrape is the link address for a user's profile which is embedded inside of that span class.I'm not sure but IF I can parse that span class, I guess I can try to scrape the link address but I keep failing to parse that hidden span class.我最终想要抓取的是嵌入在该跨度 class 内的用户配置文件的链接地址。我不确定,但如果我可以解析该跨度 class,我想我可以尝试抓取链接地址,但我保留未能解析该隐藏跨度 class。

I didn't expect that much but my codes of course gave me the empty list because that span class didn't show up when my cursor was not on the user's name.我没想到那么多,但我的代码当然给了我一个空列表,因为当我的 cursor 不在用户名上时,跨度 class 没有出现。 But I show my code to show what I've done.但我展示我的代码来展示我所做的事情。

from bs4 import BeautifulSoup
from selenium import webdriver

#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")

#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)

driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)

#parse html
html =driver.page_source
soup=BeautifulSoup(html,"html.parser")

hidden=soup.find_all("span", class_="ui_overlay ui_popover arrow_left")
print (hidden)

Are there any simple and intuitive ways to parse that hidden span class using selenium?是否有任何简单直观的方法可以使用 selenium 解析隐藏跨度 class? If I can parse it, I may use 'find' function to parse the link address for a user and then loop over all the users to get all the link addresses.如果我能解析它,我可以使用'find' function 来解析用户的链接地址,然后遍历所有用户以获取所有链接地址。 Thank you.谢谢你。

=======================updated the question by adding below=================== =======================通过添加以下内容更新了问题===================
To add some more detailed explanations on what I want to retrieve, I want to get the link that is pointed with a red arrow from the below picture.为了对我想要检索的内容添加一些更详细的解释,我想从下图中获取用红色箭头指向的链接。 Thank you for pointing out that I need more explanations.感谢您指出我需要更多解释。

在此处输入图像描述

==========================updated code so far===================== ===========================到目前为止更新的代码==================== =

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

#Incognito Mode
option=webdriver.ChromeOptions()
option.add_argument("--incognito")

#Open Chrome
driver=webdriver.Chrome(executable_path="C:/Users/chromedriver.exe",options=option)

driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html")
time.sleep(3)

profile=driver.find_element_by_xpath("//div[@class='mainContent']")
profile_pic=profile.find_element_by_xpath("//div[@class='ui_avatar large']")

ActionChains(driver).move_to_element(profile_pic).perform()
ActionChains(driver).move_to_element(profile_pic).click().perform()

#So far I could successfully hover over the first user. A few issues occur after this line.
#The error message says "type object 'By' has no attribute 'xpath'". I thought this would work since I searched on the internet how to enable this function.
waiting=wait(driver, 5).until(EC.element_to_be_clickable((By.xpath,('//span//a[contains(@href,"/Profile/")]'))))

#This gives me also a error message saying that "unable to locate the element".
#Some of the ways to code in Python and Java were different so I searched how to get the value of the xpath which contains "/Profile/" but gives me an error.
profile_box=driver.find_element_by_xpath('//span//a[contains(@href,"/Profile/")]').get_attribute("href")
print (profile_box)


Also, is there any way to iterate through xpath in this case?另外,在这种情况下,有什么方法可以遍历 xpath 吗?

I think you can use requests library instead of selenium.我认为您可以使用请求库而不是 selenium。

When you hover on username, you will get Request URL as below.当您在用户名上使用 hover 时,您将收到如下请求 URL。

第一的,

import requests
from bs4 import BeautifulSoup

html = requests.get('https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html')
print(html.status_code)

soup = BeautifulSoup(html.content, 'html.parser')

# Find all UID of username
# Split the string "UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293" into UID, SRC
# And recombine to Request URL
name = soup.find_all('div', class_="memberOverlayLink")
for i in name:
    print(i.get('id'))

# Use url to get profile link
response = requests.get('https://www.tripadvisor.com/MemberOverlay?Mode=owa&uid=805E0639C29797AEDE019E6F7DA9FF4E&c=&src=507403702&fus=false&partner=false&LsoId=&metaReferer=')
soup = BeautifulSoup(response.content, 'html.parser')
result = soup.find('a')
print(result.get('href'))

This is output:这是 output:

200
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_D37FB22A0982ED20FA4D7345A60B8826-SRC_511863293
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_805E0639C29797AEDE019E6F7DA9FF4E-SRC_507403702
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_6A86C50AB327BA06D3B8B6F674200EDD-SRC_506453752
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_97307AA9DD045AE5484EEEECCF0CA767-SRC_500684401
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
UID_E629D379A14B8F90E01214A5FA52C73B-SRC_496284746
/Profile/JLERPercy

If you want to use selenium to get popup box,如果要使用 selenium 来获取弹出框,

You can use ActionChains to do hover() function.你可以使用 ActionChains 来做 hover() function。

But I think it's less efficient than using requests.但我认为它比使用请求效率低。

from selenium.webdriver.common.action_chains import ActionChains
ActionChains(driver).move_to_element(element).perform()

Python Python

The below code will extract the href value.Try and let me know how it goes.下面的代码将提取 href 值。尝试让我知道它是如何进行的。

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome('/usr/local/bin/chromedriver')  # Optional argument, if not specified will search path.
driver.implicitly_wait(15)

driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");

#finds all the comments or profile pics
profile_pic= driver.find_elements(By.XPATH,"//div[@class='prw_rup prw_reviews_member_info_hsx']//div[@class='ui_avatar large']")

for i in profile_pic:
        #clicks all the profile pic one by one
        ActionChains(driver).move_to_element(i).perform()
        ActionChains(driver).move_to_element(i).click().perform()
        #print the href or link value
        profile_box=driver.find_element_by_xpath('//span//a[contains(@href,"/Profile/")]').get_attribute("href")
        print (profile_box)

driver.quit()

Java example: Java 示例:

import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.interactions.Actions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

public class Selenium {

    public static void main(String[] args) {
        System.setProperty("webdriver.chrome.driver", "./lib/chromedriver");
        WebDriver driver = new ChromeDriver();
        driver.manage().window().maximize();
        driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
        driver.get("https://www.tripadvisor.com/VacationRentalReview-g60742-d7951369-or20-Groove_Stone_Getaway-Asheville_North_Carolina.html");

        //finds all the comments or profiles
        List<WebElement> profile= driver.findElements(By.xpath("//div[@class='prw_rup prw_reviews_member_info_hsx']//div[@class='ui_avatar large']"));

        for(int i=0;i<profile.size();i++)
        {
            //Hover on user profile photo
            Actions builder = new Actions(driver);
            builder.moveToElement(profile.get(i)).perform();
            builder.moveToElement(profile.get(i)).click().perform();
            //Wait for user details pop-up
            WebDriverWait wait = new WebDriverWait(driver, 10);
            wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath("//span//a[contains(@href,'/Profile/')]")));
            //Extract the href value
            String hrefvalue=driver.findElement(By.xpath("//span//a[contains(@href,'/Profile/')]")).getAttribute("href");
            //Print the extracted value
            System.out.println(hrefvalue);
        }
        //close the browser
        driver.quit(); 

    }

}

output output

 https://www.tripadvisor.com/Profile/861kellyd https://www.tripadvisor.com/Profile/JLERPercy https://www.tripadvisor.com/Profile/rayn817 https://www.tripadvisor.com/Profile/grossla https://www.tripadvisor.com/Profile/kapmem

刮<div<span from html-page< div><div id="text_translate"><p> 我正在尝试使用 Eclipse 中的 Python 创建一个简单的天气预报。 到目前为止,我已经写了这个:</p><pre> from bs4 import BeautifulSoup import requests def weather_forecast(): url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen' r = requests.get(url) # Get request for contents of the page print(r.content) # Outputs HTML code for the page soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser) min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute print(min_max.prettify()) table = soup.find('div', attrs={'daily-weather-list-item__temperature'}) print(table.prettify())</pre><p> 从具有如下元素的 html 页面:</p><p> <a href="https://i.stack.imgur.com/liV2d.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/liV2d.png" alt=""></a></p><p> 我在 HTML 页面的元素中找到了第一个温度的路径,但是当我尝试执行我的代码并打印以查看我是否正确完成时,没有打印任何内容。 我的目标是打印一张带有日期和相应温度的表格,这似乎是一项简单的任务,但我不知道如何正确命名属性或如何在一次迭代中从 HTML 页面中将它们全部刮掉。</p><p> &lt;span 存储了两个温度,一个最小值和一个最大值,这里只是碰巧它们是相同的。</p><p> 我想将 go 放入每个 &lt;div class="daily-weather-list-item__temperature" 中,收集两个温度并将它们添加到字典中,我该怎么做?</p><p> 我已经在 stackoverflow 上查看了这个问题,但我无法弄清楚: <a href="https://stackoverflow.com/questions/53084902/python-beautifulsoup-scraping-div-spans-and-p-tags-also-how-to-get-exact-mat" rel="nofollow noreferrer">Python BeautifulSoup - Scraping Div Spans 和 p 标签 - 以及如何在 div 名称上获得完全匹配</a></p></div></div<span> - Scrape <div<span from HTML-page

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从“span”内的 html“类”中获取/抓取所有元素? - How to fetch/scrape all elements from a html “class” which is inside “span”? 如何从隐藏的 div 类中抓取图片? - how do I scrape the pictures from hidden div class? 刮<div<span from html-page< div><div id="text_translate"><p> 我正在尝试使用 Eclipse 中的 Python 创建一个简单的天气预报。 到目前为止,我已经写了这个:</p><pre> from bs4 import BeautifulSoup import requests def weather_forecast(): url = 'https://www.yr.no/nb/v%C3%A6rvarsel/daglig-tabell/1-92416/Norge/Vestland/Bergen/Bergen' r = requests.get(url) # Get request for contents of the page print(r.content) # Outputs HTML code for the page soup = BeautifulSoup(r.content, 'html5lib') # Parse the data with BeautifulSoup(HTML-string, html-parser) min_max = soup.select('min-max.temperature') # Select all spans with a "min-max-temperature" attribute print(min_max.prettify()) table = soup.find('div', attrs={'daily-weather-list-item__temperature'}) print(table.prettify())</pre><p> 从具有如下元素的 html 页面:</p><p> <a href="https://i.stack.imgur.com/liV2d.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/liV2d.png" alt=""></a></p><p> 我在 HTML 页面的元素中找到了第一个温度的路径,但是当我尝试执行我的代码并打印以查看我是否正确完成时,没有打印任何内容。 我的目标是打印一张带有日期和相应温度的表格,这似乎是一项简单的任务,但我不知道如何正确命名属性或如何在一次迭代中从 HTML 页面中将它们全部刮掉。</p><p> &lt;span 存储了两个温度,一个最小值和一个最大值,这里只是碰巧它们是相同的。</p><p> 我想将 go 放入每个 &lt;div class="daily-weather-list-item__temperature" 中,收集两个温度并将它们添加到字典中,我该怎么做?</p><p> 我已经在 stackoverflow 上查看了这个问题,但我无法弄清楚: <a href="https://stackoverflow.com/questions/53084902/python-beautifulsoup-scraping-div-spans-and-p-tags-also-how-to-get-exact-mat" rel="nofollow noreferrer">Python BeautifulSoup - Scraping Div Spans 和 p 标签 - 以及如何在 div 名称上获得完全匹配</a></p></div></div<span> - Scrape <div<span from HTML-page 如果存在相同 class 名称的跨度,如何刮擦跨度 class 文本? - how to scrape the span class text if there are span of same class name? 如何用同一类刮掉另一个跨度 - how to scrape the other span with same class 如何从链接列表中抓取? - How to scrape from a list of links? 如何在另一个跨度 class 内刮掉一个跨度? - How to Scrape one of the span inside another span class? 如何使用BeautifulSoup在HTML中抓取链接 - How to use BeautifulSoup to scrape links in a html 尝试通过 class 抓取 HTML 跨度值,但返回错误 - Trying to scrape an HTML span value by class, but returns error 使用beautifulsoup python在span类HTML中刮取值 - Scrape values inside span class HTML with beautifulsoup python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM