简体   繁体   English

从csv中的URL抓取HTML,然后使用python打印到csv

[英]Scraping HTML from URLs in csv then printing to csv with python

I am trying to scrape a date on a series of URLs that are in a csv and then output the dates to a new CSV. 我正在尝试在csv中的一系列URL上抓取日期,然后将日期输出到新的CSV。

I have the basic python code working but can't figure out how to load the CSV in (instead of pulling it from an array) and scrape each url and then output it to a new CSV. 我有基本的python代码,但无法弄清楚如何加载CSV(而不是从数组中提取)并抓取每个url,然后将其输出到新的CSV。 From reading a couple posts I think I would want to use the csv python module but can't get it working. 通过阅读几篇文章,我认为我想使用csv python模块,但无法正常工作。

Here is my code for the scraping part 这是我的抓取代码

import urllib
import re

exampleurls =["http://www.domain1.com","http://www.domain2.com","http://www.domain3.com"]

i=0
while i<len(exampleurls):
    url = exampleurls[i]
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = 'on [0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]'
    pattern = re.compile(regex)
    date = re.findall(pattern,htmltext)
    print date
    i+=1

Any help is much appreciated! 任何帮助深表感谢!

If your csv looks like this: 如果您的csv看起来像这样:

"http://www.domain1.com","other column","yet another"
"http://www.domain2.com","other column","yet another"
...

Extract domains like this: 像这样提取域:

import urllib
import csv

with open('urlFile.csv') as f:
    reader = csv.reader(f)

    for rec in reader:
        htmlfile = urllib.urlopen(rec[0])
        ...

And if your url file just looks like this: 如果您的网址文件看起来像这样:

http://www.domain1.com
http://www.domain2.com
...

You could do something even cooler with list comprehensions like this: 您可以通过以下列表理解来做一些更酷的事情:

urls = [x for x in open('urlFile')]

EDIT: reply to comment 编辑:回复评论

You can either open a file in python like: 您可以像这样在python中打开文件:

f = open('myurls.csv', 'w')
...
for rec in reader:
    ...
    f.write(urlstring)
f.close()

Or if you're on unix/linux just use print inside your code, then in bash: 或者,如果您使用的是unix / linux,则在代码中使用print,然后在bash中使用:

python your_scraping_script.py > someoutfile.csv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM