简体   繁体   English

使用Selenium Python(NSFW)从网页上刮取网址

[英]Scraping URLs from web pages using Selenium Python (NSFW)

I'm learning Python by trying to write a script to scrape xHamster. 我正在学习Python,试图写一个脚本来刮掉xHamster。 If anyone's familiar with the site, I'm trying to specifically write all URLs of a given user's videos to a .txt file. 如果有人熟悉该网站,我会尝试将给定用户视频的所有网址专门写入.txt文件。

Currently, I've managed to scrape the URLs off a specific page, however there are multiple pages and I'm struggling to loop through the number of pages. 目前,我已经设法从特定页面中删除了URL,但是有多个页面,我正在努力遍历页面数量。

In my attempt below I've commented where I'm trying to read the URL of the next page, however it current prints None . 在我的尝试中,我已经评论了我在哪里读取下一页的URL,但它当前打印None Any ideas why and how to resolve this? 任何想法为什么以及如何解决这个问题?

Current script: 当前脚本:

#!/usr/bin/env python

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")

driver = webdriver.Chrome(chrome_options=chrome_options)

username = **ANY_USERNAME**
##page = 1
url = "https://xhams***.com/user/video/" + username + "/new-1.html"

driver.implicitly_wait(10)
driver.get(url)

links = [];
links = driver.find_elements_by_class_name('hRotator')
#nextPage = driver.find_elements_by_class_name('last')

noOfLinks = len(links)
count = 0

file = open('x--' + username + '.txt','w')
while count < noOfLinks:
    #print links[count].get_attribute('href')
    file.write(links[count].get_attribute('href') + '\n');
    count += 1

file.close()
driver.close()

My attempt at looping through the pages: 我尝试循环遍历页面:

#!/usr/bin/env python

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")

driver = webdriver.Chrome(chrome_options=chrome_options)

username = **ANY_USERNAME**
##page = 1
url = "https://xhams***.com/user/video/" + username + "/new-1.html"

driver.implicitly_wait(10)
driver.get(url)

links = [];
links = driver.find_elements_by_class_name('hRotator')
#nextPage = driver.find_elements_by_class_name('colR')

## TRYING TO READ THE NEXT PAGE HERE
print driver.find_element_by_class_name('last').get_attribute('href')

noOfLinks = len(links)
count = 0

file = open('x--' + username + '.txt','w')
while count < noOfLinks:
    #print links[count].get_attribute('href')
    file.write(links[count].get_attribute('href') + '\n');
    count += 1

file.close()
driver.close()

UPDATE: 更新:

I've used Philippe Oger's answer below but modified the two methods below to work for single page results: 我在下面使用了Philippe Oger的答案,但修改了下面的两种方法来处理单页结果:

def find_max_pagination(self):
    start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user)
    r = requests.get(start_url)
    tree = html.fromstring(r.content)
    abc = tree.xpath('//div[@class="pager"]/table/tr/td/div/a')
    if tree.xpath('//div[@class="pager"]/table/tr/td/div/a'):
        self.max_page = max(
            [int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']]
        )
    else:
        self.max_page = 1

    return self.max_page

def generate_listing_urls(self):
    if self.max_page == 1:
        pages = [self.paginated_listing_page(str(page)) for page in range(0, 1)]
    else:
        pages = [self.paginated_listing_page(str(page)) for page in range(0, self.max_page)]

    return pages

On a user page we can actually find out how far the pagination goes, so instead of looping though the pagination, we can generate each url of the user with a list comprehension, and then scraped those one by one. 在用户页面上,我们实际上可以找出分页的距离,因此我们可以使用列表理解生成用户的每个URL,而不是循环分页,然后逐个删除。

Here are my two cents using LXML. 以下是使用LXML的两分钱。 If you simply copy/paste this code, it will return every video urls in a TXT file. 如果您只是复制/粘贴此代码,它将返回TXT文件中的每个视频网址。 You only need to change the user name. 您只需要更改用户名。

from lxml import html
import requests


class XXXVideosScraper(object):

    def __init__(self, user):
        self.user = user
        self.max_page = None
        self.video_urls = list()

    def run(self):
        self.find_max_pagination()
        pages_to_crawl = self.generate_listing_urls()
        for page in pages_to_crawl:
            self.capture_video_urls(page)
        with open('results.txt', 'w') as f:
            for video in self.video_urls:
                f.write(video)
                f.write('\n')

    def find_max_pagination(self):
        start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user)
        r = requests.get(start_url)
        tree = html.fromstring(r.content)

        try:
            self.max_page = max(
            [int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']]
        )
        except ValueError:
            self.max_page = 1
        return self.max_page

    def generate_listing_urls(self):
        pages = [self.paginated_listing_page(page) for page in range(1, self.max_page + 1)]
        return pages

    def paginated_listing_page(self, pagination):
        return 'https://www.xhamster.com/user/video/{}/new-{}.html'.format(self.user, str(pagination))

    def capture_video_urls(self, url):
        r = requests.get(url)
        tree = html.fromstring(r.content)
        video_links = tree.xpath('//a[@class="hRotator"]/@href')
        self.video_urls += video_links


if __name__ == '__main__':
    sample_user = 'wearehairy'
    scraper = XXXVideosScraper(sample_user)
    scraper.run()

I have not check the case when there is only 1 page in total for a user. 当用户总共只有1页时,我没有检查案例。 Let me know if this works fine. 让我知道这是否正常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM