从CSV文件中的网址中检索数据-Python

Question

如何修改此代码以使用csv中的url列表，转到这些页面，然后执行代码的最后一部分以检索正确的数据？

我感觉到转到csv存储的链接并从中检索数据的代码段距离很远，但是我有一个csv，其目标网址是每行列出一个，而最后一行是针对联系方式等的此代码也正常工作。

import requests
import re
from bs4 import BeautifulSoup
import csv

#Read csv
csvfile = open("gymsfinal.csv")
csvfilelist = csvfile.read()

#Get data from each url
def get_page_data():
    for page_data in csvfilelist:
        r = requests.get(page_data.strip())
        soup = BeautifulSoup(r.text, 'html.parser')
        return soup

pages = get_page_data()
'''print pages'''

#The work performed on scraped data
print soup.find("span",{"class":"wlt_shortcode_TITLE"}).text
print soup.find("span",{"class":"wlt_shortcode_map_location"}).text
print soup.find("span",{"class":"wlt_shortcode_phoneNum"}).text
print soup.find("span",{"class":"wlt_shortcode_EMAIL"}).text

th = soup.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
    match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
    if match:
        print link.text

gyms = [name,address,phoneNum,email]
gym_data_list.append(gyms)

#Saving specific listing data to csv
with open ("xgyms.csv", "wb") as file:
    writer = csv.writer(file)
    for row in gym_data_list:
        writer.writerow(row)

Gymsfinal.csv的代码段：

http://www.gym-directory.com/listing/green-apple-wellness-centre/
http://www.gym-directory.com/listing/train-247-fitness-prahran/
http://www.gym-directory.com/listing/body-club/
http://www.gym-directory.com/listing/training-glen/

更改为writer.writerow([row])以便保存csv数据，并且每个字母之间没有逗号。

Answer 1

这里有几个问题。 首先，您永远不会关闭第一个文件对象，这是很大的禁止。 您还应该在代码段底部使用with语法，以读取csv。

您收到错误requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h? requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h? 因为当您读入csv时，您只是将其作为一个大字符串读入，并带有换行符。 因此，当您使用for page_data in csvfilelist:对其进行迭代时，它会遍历字符串中的每个字符（字符串在Python中是可迭代的）。 显然，这不是有效的网址，因此请求会引发异常。 当您读入文件时，它应该看起来像这样

with open('gymsfinal.csv') as f:
    reader = csv.reader(f)
    csvfilelist = [ row[0] for row in reader ]

您还应该更改从get_page_data()返回url的方式。 目前，您只打算退回第一汤。 为了让它返回所有的汤的发电机，所有你需要做的是改变是return到yield 。 良好的产量和发电机资源。

您的打印报表也会有问题。 他们应该进入一个看起来像for soup in pages:的for循环for soup in pages:或者应该进入get_page_data() 。 在这些印刷品的上下文中没有定义可变的soup 。

从CSV文件中的网址中检索数据-Python

问题描述

1 个解决方案

解决方案1
2 2015-09-29 13:59:12

从CSV文件中的网址中检索数据-Python

问题描述

1 个解决方案

解决方案1 2 2015-09-29 13:59:12

解决方案1
2 2015-09-29 13:59:12