[英]Python: Only writes last line of output
尝试编写一个程序从网站提取URL。 输出是好的,但是当我尝试将输出写入文件时,只会写入最后一条记录。 这是代码:
import re
import urllib.request
# Retrieves URLs from the HTML source code of a website
def extractUrls(url, unique=True, sort=True, restrictToTld=None):
# Prepend "www." if not present
if url[0:4] != "www.":
url = "".join(["www.",url])
# Open a connection
with urllib.request.urlopen("http://" + url) as h:
# Grab the headers
headers = h.info()
# Default charset
charset = "ISO-8859-1"
# If a charset is in the headers then override the default
for i in headers:
match = re.search(r"charset=([\w\-]+)", headers[i], re.I)
if match != None:
charset = match.group(1).lower()
break
# Grab and decode the source code
source = h.read().decode(charset)
# Find all URLs in the source code
matches = re.findall(r"http\:\/\/(www.)?([a-z0-9\-\.]+\.[a-z]{2,6})\b", source, re.I)
# Abort if no URLs were found
if matches == None:
return None
# Collect URLs
collection = []
# Go over URLs one by one
for url in matches:
url = url[1].lower()
# If there are more than one dot then the URL contains
# subdomain(s), which we remove
if url.count(".") > 1:
temp = url.split(".")
tld = temp.pop()
url = "".join([temp.pop(),".",tld])
# Restrict to TLD if one is set
if restrictToTld:
tld = url.split(".").pop()
if tld != restrictToTld:
continue
# If only unique URLs should be returned
if unique:
if url not in collection:
collection.append(url)
# Otherwise just add the URL to the collection
else:
collection.append(url)
# Done
return sorted(collection) if sort else collection
# Test
url = "msn.com"
print("Parent:", url)
for x in extractUrls(url):
print("-", x)
f = open("f2.txt", "w+", 1)
f.write( x )
f.close()
输出为:
Parent: msn.com
- 2o7.net
- atdmt.com
- bing.com
- careerbuilder.com
- delish.com
- discoverbing.com
- discovermsn.com
- facebook.com
- foxsports.com
- foxsportsarizona.com
- foxsportssouthwest.com
- icra.org
- live.com
- microsoft.com
- msads.net
- msn.com
- msnrewards.com
- myhomemsn.com
- nbcnews.com
- northjersey.com
- outlook.com
- revsci.net
- rsac.org
- s-msn.com
- scorecardresearch.com
- skype.com
- twitter.com
- w3.org
- yardbarker.com
[Finished in 0.8s]
仅将“ yardbarker.com”写入文件。 感谢您的帮助,谢谢。
url = "msn.com"
print("Parent:", url)
f = open("f2.txt", "w",)
for x in extractUrls(url):
print("-", x)
f.write( x )
f.close()
根据其他答案,文件写入需要在循环内,但也可以尝试在x
之后写入新的行字符\\n
:
f = open("f2.txt", "w+")
for x in extractUrls(url):
print("-", x)
f.write( x +'\n' )
f.close()
return sorted(collection) if sort else collection
在应该有一个缩进的地方有两个缩进,则该行还return sorted(collection) if sort else collection
。
同样,您的子域代码可能无法提供您对www.something.com.au
类的期望,这些东西只会返回.com.au
您需要打开文件,然后在for循环中写入每个X。
最后,您可以关闭文件。
f = open("f2.txt", "w+",1)
for x in extractUrls(url):
print("-", x)
f.write( x )
f.close()
f = open("f2.txt", "w+", 1)
for x in extractUrls(url):
print("-", x)
f.write( x )
f.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.