简体   繁体   English

使用 Python ZC49DFB55F06BB4576E2C5CA78DZ 文件循环通过 CSV 文件中的 URL 链接

[英]Looping through URL links in CSV file with Python Selenium

I've got a collection of URL's in a csv file and I want to loop through these links and open each link in the CSV one at a time.我在 csv 文件中有一组 URL,我想遍历这些链接并一次打开 CSV 中的每个链接。 I'm getting several different errors depending on what I try but nonetheless I can't get the browser to open the links.根据我的尝试,我遇到了几个不同的错误,但是我无法让浏览器打开链接。 The print shows that the links are there.打印显示链接在那里。

When I run my code i get the following error:当我运行我的代码时,我收到以下错误:

Traceback (most recent call last):
  File "/Users/Main/PycharmProjects/ScrapingBot/classpassgiit.py", line 26, in <module>
    open = browser.get(link_loop)
TypeError: Object of type bytes is not JSON serializable 

Can someone help me with my code below if I am missing something or if i am doing it wrong.如果我遗漏了什么或者我做错了,有人可以帮我解决下面的代码。

My code:我的代码:

import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests

browser = webdriver.Chrome(executable_path=r'./chromedriver')

contents = []

with open('ClassPasslite.csv', 'rt') as cp_csv:
    cp_url = csv.reader(cp_csv)
    for row in cp_url:
        links = row[0]
        contents.append(links)

for link in contents:
    url_html = requests.get(links)

    for link_loop in url_html:

        print(contents)

        open = browser.get(link_loop)

Apparently, you are messing something up with the names.显然,你把名字弄乱了。 Without having a copy of the.csv file, I cannot reproduce the error - hence, I will assume that you correctly extract the link from the text file.如果没有 .csv 文件的副本,我无法重现该错误 - 因此,我假设您从文本文件中正确提取了链接。

In the second part of your code, you use requests.get to get the links (mind the plural) option, but links apparently is an element that you define in the previous section ( links = row[0] ), whereas link is the actual object you define in the for loop.在代码的第二部分中,您使用requests.get来获取links (注意复数)选项,但links显然是您在上一节中定义的元素( links = row[0] ),而link是您在 for 循环中定义的实际 object。 Below you can find a version of the code that might be a helpful starting point.您可以在下面找到可能是一个有用的起点的代码版本。

Let me add, though, that the contemporaneous use of requests and selenium in this case makes little sense in your context: why getting an HTML page and then loop over its elements to get other pages with selenium ?不过,让我补充一点,在这种情况下同时使用requestsselenium在您的上下文中没有什么意义:为什么要获取 HTML 页面,然后遍历其元素以使用selenium获取其他页面?

import csv
import requests

browser = webdriver.Chrome(executable_path=r'./chromedriver')

contents = []

with open('ClassPasslite.csv', 'rt') as cp_csv:
    cp_url = csv.reader(cp_csv)
    for row in cp_url:
        links = row[0]
        contents.append(links)

for link in contents:
    url_html = requests.get(link) # now this is singular

    # Do what you have to do here with requests, in spite of using selenium #

Since you have not provided any form of what is contained in your variable contents I will assume that it is a list of url strings.由于您没有提供任何形式的变量contents中包含的内容,我将假设它是 url 字符串的列表。

As @cap.py mentioned you are messing up by using requests and selenium at the same time.正如@cap.py 提到的,您同时使用requestsselenium了。 When you do a GET web request, the server at the destination will send you a text response.当您执行 GET web 请求时,目标服务器将向您发送文本响应。 This text can be simply some text, like Hello world!该文本可以只是一些文本,例如Hello world! or it can be some html.或者它可以是一些 html。 But this html code as to be interpreted in your computer which sent the request.但是这个 html 代码将在您发送请求的计算机中进行解释。

That's the point of selenium over requests: requests return the text gathered from the destination (url) while selenium ask a browser (eg Chrome) to do gather the text and if this text is some html, to interpret it to give you a real readable web page.这就是 selenium 超过请求的要点:请求返回从目标(url)收集的文本,而 selenium 要求浏览器(例如 Chrome)收集文本,如果这个文本是一些 ZFC35FDC70D5FC69D269883A822C,可以将其解释为真实的 ZFC35FDC70D5FC69D269883A822C, web 页面。 Moreover the browser is running the javascript inside your page so dynamic pages works as well.此外,浏览器在您的页面内运行 javascript,因此动态页面也可以正常工作。

In the end the only thing needed to run your code is to do this:最后,运行您的代码唯一需要做的就是这样做:

import csv
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as browser_wait
from selenium.webdriver.support import expected_conditions as EC
import requests

browser = webdriver.Chrome(executable_path=r'./chromedriver')

contents = []

with open('ClassPasslite.csv', 'rt') as cp_csv:
    cp_url = csv.reader(cp_csv)
    for row in cp_url:
        links = row[0]
        contents.append(links)

#link should be something like "https://www.classpass.com/studios/forever-body-coaching-london?search-id=49534025882004019"
for link in contents:
    browser.get(link)
    # paste the code you have here

Tip: Don't forget that browsers take some time to load pages.提示:不要忘记浏览器需要一些时间来加载页面。 Adding some time.sleep(3) will help you a lot.添加一些time.sleep(3)将对您有很大帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM