[英]How to take only a certain part of the element of a list?
Here's the link of the website :这是网站的链接:
Here's my script :这是我的脚本:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
driver = webdriver.Chrome(options=options, executable_path=PATH)
wait = WebDriverWait(driver, 20)
driver.get('https://fr.hotels.com/search.do?destination-id=10398359&q-check-in=2021-06-26&q-check-out=2021-06-27&q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER')
driver.maximize_window()
time.sleep(2)
webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
time.sleep(2)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'button[class="uolsaJ"]'))).click()
links = []
def is_element_visible(xpath):
wait1 = WebDriverWait(driver, 2)
try:
wait1.until(EC.visibility_of_element_located((By.XPATH, xpath)))
return True
except Exception:
return False
while not is_element_visible("//div[@id='20']"):
my_elems = driver.find_elements_by_xpath('//a[@class="_61P-R0"]')
links = [my_elem.get_attribute("href") for my_elem in my_elems]
driver.execute_script("window.scrollBy(0, 1000)")
time.sleep(5)
print(links)
Here's the output :这是输出:
['https://fr.hotels.com/ho716157152/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho397103/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho1098309152/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho449686/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho315896/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho1574324896/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho288352/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho748227104/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho225263/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho225250/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho405210/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho547798/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho252584/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho351562/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho714011808/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho424335/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho442661/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho437481/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3']
Those are the URLs of the hotels, I would like to know how to have a specific part.这些是酒店的网址,我想知道如何拥有特定部分。
I would like to have those ids present in each URLs :我希望每个 URL 中都包含这些 id:
'https://fr.hotels.com/ho437481/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3'
-> 437481
->
437481
Kind of recreate the list but just with those numbers instead of the URLs.有点重新创建列表,但只使用这些数字而不是 URL。
Something like that :类似的东西:
['716157152', '397103', '1098309152' ... , '437481']
You can use regular expressions, but if the structure is always https://fr.hotels.com/ho[your_id]/[...]
, split
will suffice:您可以使用正则表达式,但如果结构始终为
https://fr.hotels.com/ho[your_id]/[...]
, split
就足够了:
hotel_ids = [link.split('/')[3][2:] for link in links]
split
turns the string into a list like ['https:', '', 'fr.hotels.com', 'ho[your_id]']
, so the id will always be in the 4th position (index = 3), and [2:]
gets rid of the leading 'ho'. split
将字符串转换为一个列表,如['https:', '', 'fr.hotels.com', 'ho[your_id]']
,因此 id 将始终位于第 4 个位置(索引 = 3),并且[2:]
去掉了领先的“ho”。
You can just do this after you have your links
您可以在获得
links
后执行此操作
links = [s.split('/')[3][2:] for s in links]
# Output
['716157152', '397103', '1098309152', '449686', '315896', '1574324896', '288352', '748227104', '225263', '225250', '405210', '547798', '252584', '351562', '714011808', '424335', '442661', '437481']
I prefer the other answers, but regex is also a viable option.我更喜欢其他答案,但正则表达式也是一个可行的选择。
import re
in_arr = ['https://fr.hotels.com/ho716157152/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho397103/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho1098309152/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho449686/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho315896/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho1574324896/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho288352/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho748227104/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho225263/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho225250/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho405210/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho547798/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho252584/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho351562/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho714011808/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho424335/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho442661/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho437481/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3']
regex = "(?<=\.com\/ho)[\w]+"
out = map(lambda x: re.findall(regex, x)[0], in_arr)
print(list(out))
Output:输出:
['716157152', '397103', '1098309152', '449686', '315896', '1574324896', '288352', '748227104', '225263', '225250', '405210', '547798', '252584', '351562', '714011808', '424335', '442661', '437481']
Use a regexp against the string:对字符串使用正则表达式:
>>> s = 'https://fr.hotels.com/ho437481/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3'
>>> m = re.match('https://fr.hotels.com/ho(\d+)/',s)
>>> m.group(1)
'437481'
You can put that in a function and use map
against the list of URLs, or use a for loop, or even a list comprehension.你可以把它放在一个函数中,然后对 URL 列表使用
map
,或者使用 for 循环,甚至是列表理解。
#!/usr/bin/env python3
urls = [
"https://fr.hotels.com/ho716157152/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho397103/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho1098309152/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho449686/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho315896/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho1574324896/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho288352/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho748227104/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho225263/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho225250/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho405210/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho547798/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho252584/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho351562/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho714011808/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho424335/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho442661/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3",
"https://fr.hotels.com/ho437481/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3"
]
def get_id(url):
return url.split('/')[3][2:]
ids = [get_id(url) for url in urls]
print(ids)
This can be solved with regex using:这可以使用正则表达式解决:
import re
urls = ['https://fr.hotels.com/ho716157152/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3', 'https://fr.hotels.com/ho397103/?q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER&ZSX=0&SYE=3']
print([re.findall(r'\d+',url)[0] for url in urls])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.