简体   繁体   English

Python:仅写入输出的最后一行

[英]Python: Only writes last line of output

Trying to write a program that extracts URLs from a website. 尝试编写一个程序从网站提取URL。 The output is good, but when I try to write the output to a file, only the last record is written. 输出是好的,但是当我尝试将输出写入文件时,只会写入最后一条记录。 Here is the code: 这是代码:

import re
import urllib.request

# Retrieves URLs from the HTML source code of a website
def extractUrls(url, unique=True, sort=True, restrictToTld=None):
    # Prepend "www." if not present
    if url[0:4] != "www.":
        url = "".join(["www.",url])
    # Open a connection
    with urllib.request.urlopen("http://" + url) as h:
        # Grab the headers
        headers = h.info()
        # Default charset
        charset = "ISO-8859-1"
        # If a charset is in the headers then override the default
        for i in headers:
            match = re.search(r"charset=([\w\-]+)", headers[i], re.I)
            if match != None:
                charset = match.group(1).lower()
                break
        # Grab and decode the source code
        source = h.read().decode(charset)
        # Find all URLs in the source code
        matches = re.findall(r"http\:\/\/(www.)?([a-z0-9\-\.]+\.[a-z]{2,6})\b", source, re.I)
        # Abort if no URLs were found
        if matches == None:
            return None
        # Collect URLs
        collection = []
        # Go over URLs one by one
        for url in matches:
            url = url[1].lower()
            # If there are more than one dot then the URL contains
            # subdomain(s), which we remove
            if url.count(".") > 1:
                temp = url.split(".")
                tld = temp.pop()
                url = "".join([temp.pop(),".",tld])
            # Restrict to TLD if one is set
            if restrictToTld:
                tld = url.split(".").pop()
                if tld != restrictToTld:
                    continue
            # If only unique URLs should be returned
            if unique:
                if url not in collection:
                    collection.append(url)
            # Otherwise just add the URL to the collection
            else:
                collection.append(url)
        # Done
        return sorted(collection) if sort else collection

# Test
url = "msn.com"
print("Parent:", url)
for x in extractUrls(url):
    print("-", x)

f = open("f2.txt", "w+", 1)
f.write( x ) 
f.close()

The output is: 输出为:

Parent: msn.com
- 2o7.net
- atdmt.com
- bing.com
- careerbuilder.com
- delish.com
- discoverbing.com
- discovermsn.com
- facebook.com
- foxsports.com
- foxsportsarizona.com
- foxsportssouthwest.com
- icra.org
- live.com
- microsoft.com
- msads.net
- msn.com
- msnrewards.com
- myhomemsn.com
- nbcnews.com
- northjersey.com
- outlook.com
- revsci.net
- rsac.org
- s-msn.com
- scorecardresearch.com
- skype.com
- twitter.com
- w3.org
- yardbarker.com
[Finished in 0.8s]

Only "yardbarker.com" is written to the file. 仅将“ yardbarker.com”写入文件。 I appreciate the help, thank you. 感谢您的帮助,谢谢。

url = "msn.com"
print("Parent:", url)
f = open("f2.txt", "w",)
for x in extractUrls(url):
    print("-", x)
    f.write( x )
f.close()

As per other answers the file write needs to be inside the loop but also try writing a new line character \\n after x : 根据其他答案,文件写入需要在循环内,但也可以尝试在x之后写入新的行字符\\n

f = open("f2.txt", "w+")
for x in extractUrls(url):
    print("-", x)
    f.write( x +'\n' ) 
f.close()

Also the line return sorted(collection) if sort else collection has two indents where it should have one. return sorted(collection) if sort else collection在应该有一个缩进的地方有两个缩进,则该行还return sorted(collection) if sort else collection

Also your subdomain code might not give what you expect for things like www.something.com.au which will only return .com.au 同样,您的子域代码可能无法提供您对www.something.com.au类的期望,这些东西只会返回.com.au

You need to open you file then Write each X in the for loop. 您需要打开文件,然后在for循环中写入每个X。

At the end you can close the file. 最后,您可以关闭文件。

f = open("f2.txt", "w+",1)

for x in extractUrls(url):
    print("-", x)
    f.write( x ) 

f.close()
f = open("f2.txt", "w+", 1)

for x in extractUrls(url):
    print("-", x)
    f.write( x )

f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM