[英]Scraping URLs from web pages using Selenium Python (NSFW)
我正在學習Python,試圖寫一個腳本來刮掉xHamster。 如果有人熟悉該網站,我會嘗試將給定用戶視頻的所有網址專門寫入.txt文件。
目前,我已經設法從特定頁面中刪除了URL,但是有多個頁面,我正在努力遍歷頁面數量。
在我的嘗試中,我已經評論了我在哪里讀取下一頁的URL,但它當前打印None
。 任何想法為什么以及如何解決這個問題?
當前腳本:
#!/usr/bin/env python
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
driver = webdriver.Chrome(chrome_options=chrome_options)
username = **ANY_USERNAME**
##page = 1
url = "https://xhams***.com/user/video/" + username + "/new-1.html"
driver.implicitly_wait(10)
driver.get(url)
links = [];
links = driver.find_elements_by_class_name('hRotator')
#nextPage = driver.find_elements_by_class_name('last')
noOfLinks = len(links)
count = 0
file = open('x--' + username + '.txt','w')
while count < noOfLinks:
#print links[count].get_attribute('href')
file.write(links[count].get_attribute('href') + '\n');
count += 1
file.close()
driver.close()
我嘗試循環遍歷頁面:
#!/usr/bin/env python
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
driver = webdriver.Chrome(chrome_options=chrome_options)
username = **ANY_USERNAME**
##page = 1
url = "https://xhams***.com/user/video/" + username + "/new-1.html"
driver.implicitly_wait(10)
driver.get(url)
links = [];
links = driver.find_elements_by_class_name('hRotator')
#nextPage = driver.find_elements_by_class_name('colR')
## TRYING TO READ THE NEXT PAGE HERE
print driver.find_element_by_class_name('last').get_attribute('href')
noOfLinks = len(links)
count = 0
file = open('x--' + username + '.txt','w')
while count < noOfLinks:
#print links[count].get_attribute('href')
file.write(links[count].get_attribute('href') + '\n');
count += 1
file.close()
driver.close()
更新:
我在下面使用了Philippe Oger的答案,但修改了下面的兩種方法來處理單頁結果:
def find_max_pagination(self):
start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user)
r = requests.get(start_url)
tree = html.fromstring(r.content)
abc = tree.xpath('//div[@class="pager"]/table/tr/td/div/a')
if tree.xpath('//div[@class="pager"]/table/tr/td/div/a'):
self.max_page = max(
[int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']]
)
else:
self.max_page = 1
return self.max_page
def generate_listing_urls(self):
if self.max_page == 1:
pages = [self.paginated_listing_page(str(page)) for page in range(0, 1)]
else:
pages = [self.paginated_listing_page(str(page)) for page in range(0, self.max_page)]
return pages
在用戶頁面上,我們實際上可以找出分頁的距離,因此我們可以使用列表理解生成用戶的每個URL,而不是循環分頁,然后逐個刪除。
以下是使用LXML的兩分錢。 如果您只是復制/粘貼此代碼,它將返回TXT文件中的每個視頻網址。 您只需要更改用戶名。
from lxml import html
import requests
class XXXVideosScraper(object):
def __init__(self, user):
self.user = user
self.max_page = None
self.video_urls = list()
def run(self):
self.find_max_pagination()
pages_to_crawl = self.generate_listing_urls()
for page in pages_to_crawl:
self.capture_video_urls(page)
with open('results.txt', 'w') as f:
for video in self.video_urls:
f.write(video)
f.write('\n')
def find_max_pagination(self):
start_url = 'https://www.xhamster.com/user/video/{}/new-1.html'.format(self.user)
r = requests.get(start_url)
tree = html.fromstring(r.content)
try:
self.max_page = max(
[int(x.text) for x in tree.xpath('//div[@class="pager"]/table/tr/td/div/a') if x.text not in [None, '...']]
)
except ValueError:
self.max_page = 1
return self.max_page
def generate_listing_urls(self):
pages = [self.paginated_listing_page(page) for page in range(1, self.max_page + 1)]
return pages
def paginated_listing_page(self, pagination):
return 'https://www.xhamster.com/user/video/{}/new-{}.html'.format(self.user, str(pagination))
def capture_video_urls(self, url):
r = requests.get(url)
tree = html.fromstring(r.content)
video_links = tree.xpath('//a[@class="hRotator"]/@href')
self.video_urls += video_links
if __name__ == '__main__':
sample_user = 'wearehairy'
scraper = XXXVideosScraper(sample_user)
scraper.run()
當用戶總共只有1頁時,我沒有檢查案例。 讓我知道這是否正常。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.