简体   繁体   中英

Scraping HTML from URLs in csv then printing to csv with python

I am trying to scrape a date on a series of URLs that are in a csv and then output the dates to a new CSV.

I have the basic python code working but can't figure out how to load the CSV in (instead of pulling it from an array) and scrape each url and then output it to a new CSV. From reading a couple posts I think I would want to use the csv python module but can't get it working.

Here is my code for the scraping part

import urllib
import re

exampleurls =["http://www.domain1.com","http://www.domain2.com","http://www.domain3.com"]

i=0
while i<len(exampleurls):
    url = exampleurls[i]
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = 'on [0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]'
    pattern = re.compile(regex)
    date = re.findall(pattern,htmltext)
    print date
    i+=1

Any help is much appreciated!

If your csv looks like this:

"http://www.domain1.com","other column","yet another"
"http://www.domain2.com","other column","yet another"
...

Extract domains like this:

import urllib
import csv

with open('urlFile.csv') as f:
    reader = csv.reader(f)

    for rec in reader:
        htmlfile = urllib.urlopen(rec[0])
        ...

And if your url file just looks like this:

http://www.domain1.com
http://www.domain2.com
...

You could do something even cooler with list comprehensions like this:

urls = [x for x in open('urlFile')]

EDIT: reply to comment

You can either open a file in python like:

f = open('myurls.csv', 'w')
...
for rec in reader:
    ...
    f.write(urlstring)
f.close()

Or if you're on unix/linux just use print inside your code, then in bash:

python your_scraping_script.py > someoutfile.csv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM