無法糾正-ValueError：未知的網址類型：鏈接

Question

我目前正在運行此代碼，以將文章網址鏈接抓取到csv文件中，並且還訪問這些urls（在csv文件中）以將各自的信息抓取到文本文件中。

我可以抓取csv文件的鏈接，但無法訪問csv文件以抓取更多信息（也未創建文本文件），並且遇到ValueError

import csv
from lxml import html
from time import sleep
import requests
from bs4 import BeautifulSoup
import urllib
import urllib2 
from random import randint

outputFile = open("All_links.csv", r'wb')
fileWriter = csv.writer(outputFile)

fileWriter.writerow(["Link"])
#fileWriter.writerow(["Sl. No.", "Page Number", "Link"])

url1 = 'https://www.marketingweek.com/page/'
url2 = '/?s=big+data'

sl_no = 1

#iterating from 1st page through 361th page
for i in xrange(1, 361):

    #generating final url to be scraped using page number
    url = url1 + str(i) + url2

    #Fetching page
    response = requests.get(url)
    sleep(randint(10, 20))
    #using html parser
    htmlContent = html.fromstring(response.content)

    #Capturing all 'a' tags under h2 tag with class 'hentry-title entry-title'
    page_links = htmlContent.xpath('//div[@class = "archive-constraint"]//h2[@class = "hentry-title entry-title"]/a/@href')
    for page_link in page_links:
        print page_link
        fileWriter.writerow([page_link])
        sl_no += 1

with open('All_links.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
        url = line[0]       
        soup = BeautifulSoup(urllib2.urlopen(url))


        with open('LinksOutput.txt', 'a+') as f2:
            for tag in soup.find_all('p'):
                f2.write(tag.text.encode('utf-8') + '\n')

這是我遇到的錯誤：

  File "c:\users\rrj17\documents\visual studio 2015\Projects\webscrape\webscrape\webscrape.py", line 47, in <module>
    soup = BeautifulSoup(urllib2.urlopen(url))
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 421, in open
    protocol = req.get_type()
  File "C:\Python27\lib\urllib2.py", line 283, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: Link

要求一些幫助。

Answer 1

嘗試跳過csv文件中的第一行...您可能在不知不覺中嘗試解析標頭。

with open('All_links.csv', 'rb') as f1:
    reader = csv.reader(f1)
    next(reader) # read the header and send it to oblivion

    for line in reader: # NOW start reading
        ...

您也不需要f1.seek(0) ，因為f1在讀取模式下自動指向文件的開頭。

無法糾正-ValueError：未知的網址類型：鏈接

問題描述

1 個解決方案

解決方案1
2 已采納 2017-08-07 03:05:32

無法糾正-ValueError：未知的網址類型：鏈接

問題描述

1 個解決方案

解決方案1 2 已采納 2017-08-07 03:05:32

解決方案1
2 已采納 2017-08-07 03:05:32