简体   繁体   中英

Reading url from file Python

can not read the url in txt file I want to read and open the url addresses in txt one by one, and I want to get the title of the title with regex from the source of url addresses Error messages:

Traceback (most recent call last): File "Mypy.py", line 14, in UrlsOpen = urllib2.urlopen(listSplit) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 420, in open req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout'

Mypy.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import requests
import urllib2
import threading

UrlListFile = open("Url.txt","r") 
UrlListRead = UrlListFile.read() 
UrlListFile.close() 
listSplit = UrlListRead.split('\r\n')


    UrlsOpen = urllib2.urlopen(listSplit)
    ReadSource = UrlsOpen.read().decode('utf-8')
    regex = '<title.*?>(.+?)</title>'
    comp = re.compile(regex)
    links = re.findall(comp,ReadSource)
    for i in links:
        SaveDataFiles = open("SaveDataMyFile.txt","w")
        SaveDataFiles.write(i)
    SaveDataFiles.close()

When you are calling urllib2.urlopen(listSplit) listSplit is a list when it needs to be a string or request object . It's a simple fix to iterate over the listSplit instead of passing the entire list to urlopen.

Also re.findall() will return a list for each ReadSource searched. You can handle this a couple of ways:

I chose to handle it by just making a list of lists

websites = [ [link, link], [link], [link, link, link]

and iterating over both lists. This makes it so you can do something specific for each list of urls from each website (put in different file ect...).

You could also flatten the website list to just contain the links instead of another list that then contains the links:

links = [link, link, link, link]

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import urllib2
from pprint import pprint

UrlListFile = open("Url.txt", "r")
UrlListRead = UrlListFile.read()
UrlListFile.close()
listSplit = UrlListRead.splitlines()
pprint(listSplit)
regex = '<title.*?>(.+?)</title>'
comp = re.compile(regex)
websites = []
for url in listSplit:
    UrlsOpen = urllib2.urlopen(url)
    ReadSource = UrlsOpen.read().decode('utf-8')
    websites.append(re.findall(comp, ReadSource))

with open("SaveDataMyFile.txt", "w") as SaveDataFiles:
    for website in websites:
        for link in website:
            pprint(link)
            SaveDataFiles.write(link.encode('utf-8'))
    SaveDataFiles.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM