[英]I am getting error while replacing url in orginal code python,selenium,re
i used code in this question https://codereview.stackexchange.com/questions/241842/webscraping-with-selenium-a-course-downloader-and-sorter/248712#248712我在这个问题中使用了代码https://codereview.stackexchange.com/questions/241842/webscraping-with-selenium-a-course-downloader-and-sorter/248712#248712
i replaced url in that code to my url when i compile get an error shown below我将该代码中的 url 替换为我的 url,当我编译时出现如下所示的错误
line 66, in current_file_name = re.search(r'https://player.hdflixcore.workers.dev//0://Courses//Account%20Cracking%20--MrSihag//TN%20Cracking%20Course%20--MrSihag/.+/(.+)', download_path, re.DOTALL).group(1) AttributeError: 'NoneType' object has no attribute 'group'第66行,在current_file_name = re.search(r'https://player.hdflixcore.workers.dev//0://Courses//Account%20Cracking%20--MrSihag//TN%20Cracking%20Course%20- -MrSihag/.+/(.+)', download_path, re.DOTALL).group(1) AttributeError: 'NoneType' object 没有属性 'group'
i figured i that in code i used websiteaddress in "current_file_name" has some extra letters like backward slash我想我在代码中使用了“current_file_name”中的网站地址有一些额外的字母,如反斜杠
i have no idea about it i tried to do like same by adding some backward slash but no fix我对此一无所知 我试图通过添加一些反斜杠来做同样的事情但没有修复
but when i run orginal code it works fine when i use it in my desired site it end up with error that mentioned above但是当我运行原始代码时它工作正常当我在我想要的站点中使用它时它最终会出现上述错误
below is my edited code下面是我编辑的代码
from selenium import webdriver
import time
import os
import shutil
import re
path = r'https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/'
# For changing the download location for this browser temporarily
options = webdriver.ChromeOptions()
preferences = {"download.default_directory": r"C:\Users\shanid\Desktop\test", "safebrowsing.enabled": "false"}
options.add_experimental_option("prefs", preferences)
# Acquire the Course Link and Get all the directories
browser = webdriver.Chrome(chrome_options=options)
browser.get(r"https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/")
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
# loop for as many directories there are
for i in range(0, len(elements)):
print("deft")
# At each directory, it refreshes the page to update the webelements in the list, and returns the current directory that is being worked on
browser.get(path)
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
element = elements[i]
# checks if the folder for the directory already exists
current_directory_name = element.text[11:].strip(" .")
current_folder_path = "C:\\Users\\shanid\\Desktop\\test\\" + current_directory_name
if os.path.exists(current_folder_path):
pass
else:
os.mkdir(current_folder_path)
# Formatting what has been downloaded and sorted, and
print(current_directory_name, "------------------------------", sep="\n")
# moves on to the directory to get the page with the files
element.click()
# pausing for a few secs for the page to load, and running the same mechanism to get each file using the same method used in directory
time.sleep(3)
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
for j in range(len(files)):
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
_file = files[j]
# constants for some if statements
download = True
move = True
current_file_name = _file.text[17:].strip()
# If file exists, then pass over it, and don't do anything, and moveon to next file
if os.path.exists(current_folder_path + "\\" + current_file_name):
pass
# If it doesnt exist, then depending on its extension, do specific actions with it
else:
# Downloads the mp4 files by clicking on it, and finding the input tag which contains the download link for vid in its value attribute
if ".mp4" in current_file_name:
_file.click()
time.sleep(2)
download_path = browser.find_element_by_css_selector("input").get_attribute("value")
current_file_name = re.search(r'https://player.hdflixcore.workers.dev//0://Courses//Account%20Cracking%20--MrSihag//TN%20Cracking%20Course%20--MrSihag/.+/(.+)', download_path, re.DOTALL).group(1)
# Checks if file exists again, incase the filename is different then the predicted filename orderly generated.
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
# returns to the previous page with the files
browser.back()
# self explanatory
elif ".html" in current_file_name:
download_path = path + current_directory_name + "/" + current_file_name
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
else:
# acquires the download location by going to the parent tag which is an a tag containing the link for html in its 'href' attribute
download_path = _file.find_element_by_xpath('..').get_attribute('href').replace(r"%5E", "^")
current_file_name = re.search(r'https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/.+/(.+)', download_path, re.DOTALL).group(1).replace("%20", " ")
time.sleep(2)
current_file_path = "C:\\Users\\shanid\\Desktop\\test\\" + current_file_name
# responsible for downloading it using a path, get allows downloading, by source links
if download:
browser.get(download_path)
# while the file doesn't exist/ it hasn't been downloaded yet, do nothing
while True:
if os.path.exists(current_file_path):
break
time.sleep(1)
# moves the file from the download spot to its own folder
if move:
shutil.move(current_file_path, current_folder_path + "\\" + current_file_name)
print(current_file_name)
# formatter
print("------------------------------", "", sep="\n")
time.sleep(3)
orginal code below原代码如下
from selenium import webdriver
import time
import os
import shutil
import re
path = r'https://coursevania.courses.workers.dev/[coursevania.com]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20+%20Algorithms/'
# For changing the download location for this browser temporarily
options = webdriver.ChromeOptions()
preferences = {"download.default_directory": r"E:\Utilities_and_Apps\Python\MY PROJECTS\Test data\Downloads", "safebrowsing.enabled": "false"}
options.add_experimental_option("prefs", preferences)
# Acquire the Course Link and Get all the directories
browser = webdriver.Chrome(chrome_options=options)
browser.get(r"https://coursevania.courses.workers.dev/[coursevania.com]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20+%20Algorithms/")
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
# loop for as many directories there are
for i in range(0, len(elements)):
# At each directory, it refreshes the page to update the webelements in the list, and returns the current directory that is being worked on
browser.get(path)
time.sleep(2)
elements = browser.find_elements_by_css_selector(".mdui-text-truncate")
element = elements[i]
# checks if the folder for the directory already exists
current_directory_name = element.text[11:].strip(" .")
current_folder_path = "E:\\Utilities_and_Apps\\Python\\MY PROJECTS\\Test data\Downloads\\" + current_directory_name
if os.path.exists(current_folder_path):
pass
else:
os.mkdir(current_folder_path)
# Formatting what has been downloaded and sorted, and
print(current_directory_name, "------------------------------", sep="\n")
# moves on to the directory to get the page with the files
element.click()
# pausing for a few secs for the page to load, and running the same mechanism to get each file using the same method used in directory
time.sleep(3)
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
for j in range(len(files)):
files = browser.find_elements_by_css_selector(".mdui-text-truncate")
_file = files[j]
# constants for some if statements
download = True
move = True
current_file_name = _file.text[17:].strip()
# If file exists, then pass over it, and don't do anything, and moveon to next file
if os.path.exists(current_folder_path + "\\" + current_file_name):
pass
# If it doesnt exist, then depending on its extension, do specific actions with it
else:
# Downloads the mp4 files by clicking on it, and finding the input tag which contains the download link for vid in its value attribute
if ".mp4" in current_file_name:
_file.click()
time.sleep(2)
download_path = browser.find_element_by_css_selector("input").get_attribute("value")
current_file_name = re.search(r'https://coursevania.courses.workers.dev/\[coursevania.com\]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20\+%20Algorithms/.+/(.+)', download_path, re.DOTALL).group(1)
# Checks if file exists again, incase the filename is different then the predicted filename orderly generated.
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
# returns to the previous page with the files
browser.back()
# self explanatory
elif ".html" in current_file_name:
download_path = path + current_directory_name + "/" + current_file_name
if os.path.exists(current_folder_path + "\\" + current_file_name):
move = False
download = False
else:
# acquires the download location by going to the parent tag which is an a tag containing the link for html in its 'href' attribute
download_path = _file.find_element_by_xpath('..').get_attribute('href').replace(r"%5E", "^")
current_file_name = re.search(r'https://coursevania.courses.workers.dev/\[coursevania.com\]%20Udemy%20-%20Master%20the%20Coding%20Interview%20Data%20Structures%20\+%20Algorithms/.+/(.+)', download_path, re.DOTALL).group(1).replace("%20", " ")
time.sleep(2)
current_file_path = "E:\\Utilities_and_Apps\\Python\\MY PROJECTS\\Test data\Downloads\\" + current_file_name
# responsible for downloading it using a path, get allows downloading, by source links
if download:
browser.get(download_path)
# while the file doesn't exist/ it hasn't been downloaded yet, do nothing
while True:
if os.path.exists(current_file_path):
break
time.sleep(1)
# moves the file from the download spot to its own folder
if move:
shutil.move(current_file_path, current_folder_path + "\\" + current_file_name)
print(current_file_name)
# formatter
print("------------------------------", "", sep="\n")
time.sleep(3)
this code works fine此代码工作正常
but not working when i change the website to https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/但当我将网站更改为https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/时无法正常工作
the site i used is clone of orginal site我使用的网站是原始网站的克隆
i have no idea why getting error我不知道为什么会出错
The issue is with the CSS selector for input box on below page.问题出在下页输入框的 CSS 选择器上。 https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/01%20Course%20Introduction/1%20Course%20Introduction.mp4?a=view
https://player.hdflixcore.workers.dev/0:/Courses/Account%20Cracking%20--MrSihag/TN%20Cracking%20Course%20--MrSihag/01%20Course%20Introduction/1%20Course%20Introduction.mp4 ?a=视图
There are 2 inputs boxes on the page, so you have to write CSS path as "#content > div > div:nth-child(6) > input"
.页面上有 2 个输入框,因此您必须将 CSS 路径写为
"#content > div > div:nth-child(6) > input"
。
Code with issue.有问题的代码。
download_path = browser.find_element_by_css_selector("input").get_attribute("value")
To be replaced with.被替换为。
download_path = browser.find_element_by_css_selector("#content > div > div:nth-child(6) > input").get_attribute("value")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.