[英]selenium.common.exceptions.InvalidArgumentException: Message: invalid argument error invoking get() with urls read from text file with Selenium Python
[英]InvalidArgumentException: invalid argument error using Selenium and Pandas scraping urls reading from a CSV
我试图抓取一个网站,其中可能的网址在 csv 中。 因此,在通过 for 循环调用我的方法之后,我将打开 url 并废弃该站点的内容。
但由于某种原因,我无法循环打开网址
这是我的代码:
from selenium import webdriver
import time
import pandas as pd
chrome_options = webdriver.ChromeOptions();
chrome_options.add_argument('--user-agent="Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"')
driver = webdriver.Chrome(chrome_options=chrome_options)
csv_data = pd.read_csv("./data_new.csv")
df = pd.DataFrame(csv_data);
urls = df['url']
print(urls[: 5])
def scrap_site(url):
print("Recived URL ---> ", url)
driver.get(url)
time.sleep(5)
driver.quit()
for url in urls:
print("URL ---> ", url)
scrap_site(url)
我得到的控制台错误
Traceback (most recent call last):
File "/media/sf_shared_folder/scrape_project/course.py", line 56, in <module>
scrap_site(url)
File "/media/sf_shared_folder/scrape_project/course.py", line 35, in scrap_site
driver.get(url)
File "/home/mujib/anaconda3/envs/spyder/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/home/mujib/anaconda3/envs/spyder/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/mujib/anaconda3/envs/spyder/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
InvalidArgumentException: invalid argument
(Session info: chrome=96.0.4664.45)
CSV 文件具有以下格式
url
http://www.somesite.com
http://www.someothersite.com
对于 CSV 文件,例如:
通读pandas并将其存储在DataFrame中并使用以下命令创建列表:
urls = df['url']
在打印列表时,您将观察到列表项包含列索引:
urls = df['urls']
print(urls)
控制台 Output:
0 https://stackoverflow.com/
1 https://www.google.com/
Name: urls, dtype: object
此处列表中的 url 不是有效的 url。 因此,您会看到命令的错误:
self.execute(Command.GET, {'url': url})
实际上,您需要使用tolist()
抑制列索引,如下所示:
urls = df['urls'].tolist()
print(urls)
控制台 Output:
['https://stackoverflow.com/', 'https://www.google.com/']
此 url 列表是通过get()
调用以进行进一步抓取的有效 url。
您需要将driver = webdriver.Chrome(chrome_options=chrome_options)
放入循环中。 一旦driver.quit()
被调用,你必须再次定义驱动程序。
from selenium import webdriver
import time
import pandas as pd
chrome_options = webdriver.ChromeOptions();
chrome_options.add_argument('--user-agent="Mozilla/5.0 (Windows Phone 10.0; Android 4.2.1; Microsoft; Lumia 640 XL LTE) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Mobile Safari/537.36 Edge/12.10166"')
# driver = webdriver.Chrome(chrome_options=chrome_options)
csv_data = pd.read_csv("./data_new.csv")
df = pd.DataFrame(csv_data);
# urls = df['url']
urls = ['https://stackoverflow.com/',
'https://www.yahoo.com/']
print(urls[: 5])
def scrap_site(url):
############## OPEN THE DRIVER HERE ##############
driver = webdriver.Chrome(chrome_options=chrome_options)
############## OPEN THE DRIVER HERE ##############
print("Recived URL ---> ", url)
driver.get(url)
time.sleep(5)
driver.quit()
for url in urls:
print("URL ---> ", url)
scrap_site(url)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.