简体   繁体   English

刮擦并将输出写入文本文件

[英]Scraping and writing output to text file

I coded this scraper using Python 2.7 to fetch links from the first 3 pages of TrueLocal.com.au and write them to a text file. 我使用Python 2.7编写了这个刮刀,从TrueLocal.com.au的前3页获取链接,并将它们写入文本文件。

When I run the program, only the first link is written in the text file. 当我运行程序时,只有第一个链接写在文本文件中。 What can I do so that all the URLs returned are written on the file? 我可以做什么,以便返回的所有URL都写在文件上?

import requests
from bs4 import BeautifulSoup

def tru_crawler(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.truelocal.com.au/find/car-rental/' + str(page)
        code = requests.get(url)
        text = code.text
        soup = BeautifulSoup(text)
        for link in soup.findAll('a', {'class':'name'}):
            href = 'http://www.truelocal.com.au' + link.get('href')
            fob = open('c:/test/true.txt', 'w')
            fob.write(href + '\n')
            fob.close()
            print (href)
        page += 1

#Run the function
tru_crawler(3)

Your problem is that for each link, you open the output file, write it, then close the file again. 您的问题是,对于每个链接,您打开输出文件,写入,然后再次关闭该文件。 Not only is this inefficient, but unless you open the file in "append" mode each time, it will just get overwritten. 这不仅效率低,而且除非您每次以“追加”模式打开文件,否则它将被覆盖。 What's happening is actually that the last link gets left in the file and everything prior is lost. 实际上发生的事情是最后一个链接留在文件中,之前的所有内容都丢失了。

The quick fix would be to change the open mode from 'w' to 'a' , but it'd be even better to slightly restructure your program. 快速解决方案是将打开模式'w'更改为'a' ,但稍微重新构建程序会更好。 Right now the tru_crawler function is responsible for both crawling your site and writing output; 现在, tru_crawler函数负责抓取您的网站和编写输出; instead it's better practice to have each function responsible for one thing only. 相反,更好的做法是让每个功能只负责一件事。

You can turn your crawl function into a generator that yields links one at a time, and then write the generated output to a file separately. 您可以将爬网功能转换为一次生成一个链接的生成器 ,然后将生成的输出分别写入文件。 Replace the three fob lines with: 将三个fob线替换为:

    yield href + '\n'

Then you can do the following: 然后,您可以执行以下操作:

lines = tru_crawler(3)
filename = 'c:/test/true.txt'
with open(filename, 'w') as handle:
    handle.writelines(lines)

Also note the usage of the with statement ; 还要注意with statement的用法; opening the file using with automatically closes it once that block ends, saving you from having to call close() yourself. 打开文件使用with一旦该块结束自动关闭它,从而使您不必自己调用close()


Taking the idea of generators and task-separation one step further, you may notice that the tru_crawler function is also responsible for generating the list of URLs to crawl. 将生成器和任务分离的思想更进一步,您可能会注意到tru_crawler函数负责生成要爬网的URL列表。 That too can be separated out, if your crawler accepts an iterable of URLs instead of creating them itself. 如果您的抓取工具接受可迭代的URL而不是自己创建URL,那么也可以分离出来。 Something like: 就像是:

def make_urls(base_url, pages):
    for page in range(1, pages+1):
        yield base_url + str(page)

def crawler(urls):
    for url in urls:
        #fetch, parse, and yield hrefs

Then, instead of calling tru_crawler(3) , it becomes: 然后,它变成:而不是调用tru_crawler(3) ,而不是

urls = make_urls('http://www.truelocal.com.au/find/car_rental/', 3)
lines = crawler(urls)

and then proceed as above. 然后按上述步骤进行。

Now if you want to crawl other sites, you can just change your make_urls call, or create different generators for other URL-patterns, and the rest of your code doesn't need to change! 现在,如果您想要抓取其他网站,您只需更改make_urls调用,或为其他网址模式创建不同的生成器,其余代码无需更改!

By default 'w' is truncating mode and you may need append mode. 默认情况下,'w'是截断模式,您可能需要追加模式。 See: https://docs.python.org/2/library/functions.html#open . 请参阅: https//docs.python.org/2/library/functions.html#open

Maybe appending your hrefs to a list in while loop and then write to file later would look readable. 也许将你的hrefs附加到while循环中的列表中然后写入文件看起来可读。 Or as suggested use yield for efficiency. 或者按照建议使用yield来提高效率。

Something like 就像是

with open('c:/test/true.txt', 'w') as fob:
    fob.writelines(yourlistofhref)

https://docs.python.org/2/library/stdtypes.html#file.writelines https://docs.python.org/2/library/stdtypes.html#file.writelines

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM