简体   繁体   English

Ruby会打开文件但不写入它?

[英]Ruby will open file but not write to it?

I'm trying to create a basic ruby scraper that will grab all words 8 letters or longer from html source code. 我正在尝试创建一个基本的ruby scraper,它将从html源代码中获取8个字母或更长的单词。 Then it saves these in a file corresponding to the first character of the word. 然后将它们保存在与单词的第一个字符对应的文件中。 Seems simple huh? 好像很简单吧?

    re = /\w{8,}/
    cre = /[a-z0-9]/
    a = b.html    #This grabs the html from the browser
    matchx = a.scan(re)
    matchx.each do |xx|
        word = xx.to_s.downcase.chomp
        fchar = word[0].chr

        if (fchar.match(cre)) #Not sure if I need this
            @pcount += 1
            fname = @WordsFName+fchar   #@WordsFName is a prefix
            tmpF = File.open(fname,"a+")

            #Check for duplicates, if not write to file
            exists = File.readlines(fname).any? { |li| li[word] }
            if (!exists)                    
                tmpF.write(word+"\n")
                print word 
                @wcount += 1
            end
        end

    end

Ruby successfully grabs all the words, gets the first character, and opens all the necessary files, but fails to write to it. Ruby成功抓取所有单词,获取第一个字符,并打开所有必需的文件,但无法写入。 Also, the print method prints all words including duplicates, but inspecting the any? 此外,打印方法打印所有单词,包括重复,但检查任何? method on irb gave no problems.. irb上的方法没有问题..

File#write is buffered and you do nothing to flush or close tmpF between your write and the File.readlines(fname), so the readlines will never see the output until it's flushed. File#write是缓冲的,你不会在你的write和File.readlines(fname)之间刷新或关闭tmpF,所以readlines在刷新之前永远不会看到输出。 I don't see any call to close on tmpF so, it's not clear when the write data will get flushed except program exit when the file object is finalized, or GC some time after tmpF goes out of scope. 我没有看到任何关闭tmpF的调用,所以,当文件对象最终确定时,除了程序退出,或者在tmpF超出范围之后的某个时间点,写入数据将被刷新除外。

You could manually flush after the write with tmpF.flush , or make that default behavior with tmpF.sync = true after the open. 您可以在使用tmpF.flush写入后手动刷新,或者在打开后使用tmpF.sync = true进行默认行为。

Note that as each file gets bigger, the cost of your dup check is going to balloon as it rereads the whole file. 请注意,随着每个文件变大,重复检查的成本将会重新读取整个文件。 If the word set fits in memory, consider just keeping aa hash of words you've seen, if it's bigger than can be stored in memory, consider a key-value store instead of rereading a serial file every time. 如果单词集适合内存,考虑只保留你看过的单词的散列,如果它大于可以存储在内存中的单词,考虑键值存储而不是每次重读一个串行文件。

I played around in irb to understand flushing behavior. 我在irb玩弄了解冲洗行为。 The main problem with OP code is there's no explicit/implicit flush or close on the tmpF file. OP代码的主要问题是tmpF文件没有显式/隐式刷新或关闭。 So the partial writes which are likely less than the buffer size only get written when the tmpF File object gets garbage collected or upon program exit. 因此,只有当tmpF File对象被垃圾收集或程序退出时才会写入可能小于缓冲区大小的部分写入。 tmpF gets assigned a newly opened file object each time through the loop, so the files opened on prior iterations only get flushed when they get finalized at GC. 每次通过循环时,tmpF都会被分配一个新打开的文件对象,因此在先前的迭代中打开的文件只有在GC完成时才会被刷新。

irb(main):001:0> t=File.open('zzz','a+')
=> #<File:zzz>
irb(main):002:0> t.write '123'
=> 3
irb(main):003:0> File.readlines('zzz')
=> []
irb(main):004:0> t=File.open('zzz','a+')
=> #<File:zzz>
irb(main):005:0> t.write '456'
=> 3
irb(main):006:0> File.readlines('zzz')
=> []
irb(main):007:0> t.close
=> nil
irb(main):008:0> File.readlines('zzz')
=> ["456"]
irb(main):009:0> t=File.open('zzz','a+')
=> #<File:zzz>
irb(main):010:0> t.write '789'
=> 3
irb(main):011:0> File.readlines('zzz')
=> ["456"]
irb(main):012:0> t.flush
=> #<File:zzz>
irb(main):013:0> File.readlines('zzz')
=> ["456789"]
irb(main):014:0> GC.start
=> nil
irb(main):015:0> File.readlines('zzz')
=> ["456789123"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM