简体   繁体   中英

Reading the last n lines of a file in Ruby?

I need to read the last 25 lines from a file (for displaying the most recent log entries). Is there anyway in Ruby to start at the end of a file and read it backwards?

如果在带有tail的* nix系统上,你可以这样作弊:

last_25_lines = `tail -n 25 whatever.txt`

Is the file large enough that you need to avoid reading the whole thing? If not, you could just do

IO.readlines("file.log")[-25..-1]

If it is to big, you may need to use IO#seek to read from near the end of the file, and continue seeking toward the beginning until you've seen 25 lines.

There is a library for Ruby called File::Tail . This can get you the last N lines of a file just like the UNIX tail utility.

I assume there is some seek optimization in place in the UNIX version of tail with benchmarks like these (tested on a text file just over 11M):

[john@awesome]$du -sh 11M.txt
11M     11M.txt
[john@awesome]$time tail -n 25 11M.txt
/sbin/ypbind
/sbin/arptables
/sbin/arptables-save
/sbin/change_console
/sbin/mount.vmhgfs
/misc
/csait
/csait/course
/.autofsck
/~
/usb
/cdrom
/homebk
/staff
/staff/faculty
/staff/faculty/darlinr
/staff/csadm
/staff/csadm/service_monitor.sh
/staff/csadm/.bash_history
/staff/csadm/mysql5
/staff/csadm/mysql5/MySQL-server-community-5.0.45-0.rhel5.i386.rpm
/staff/csadm/glibc-common-2.3.4-2.39.i386.rpm
/staff/csadm/glibc-2.3.4-2.39.i386.rpm
/staff/csadm/csunixdb.tgz
/staff/csadm/glibc-headers-2.3.4-2.39.i386.rpm

real    0m0.012s
user    0m0.000s
sys     0m0.010s

I can only imagine the Ruby library uses a similar method.

Edit:

for Pax's curiosity:

[john@awesome]$time cat 11M.txt | tail -n 25
/sbin/ypbind
/sbin/arptables
/sbin/arptables-save
/sbin/change_console
/sbin/mount.vmhgfs
/misc
/csait
/csait/course
/.autofsck
/~
/usb
/cdrom
/homebk
/staff
/staff/faculty
/staff/faculty/darlinr
/staff/csadm
/staff/csadm/service_monitor.sh
/staff/csadm/.bash_history
/staff/csadm/mysql5
/staff/csadm/mysql5/MySQL-server-community-5.0.45-0.rhel5.i386.rpm
/staff/csadm/glibc-common-2.3.4-2.39.i386.rpm
/staff/csadm/glibc-2.3.4-2.39.i386.rpm
/staff/csadm/csunixdb.tgz
/staff/csadm/glibc-headers-2.3.4-2.39.i386.rpm

real    0m0.350s
user    0m0.000s
sys     0m0.130s

still under a second, but if there is a lot of file operations this makes a big difference.

Improved version of manveru's excellent seek-based solution. This one returns exactly n lines.

class File

  def tail(n)
    buffer = 1024
    idx = [size - buffer, 0].min
    chunks = []
    lines = 0

    begin
      seek(idx)
      chunk = read(buffer)
      lines += chunk.count("\n")
      chunks.unshift chunk
      idx -= buffer
    end while lines < ( n + 1 ) && pos != 0

    tail_of_file = chunks.join('')
    ary = tail_of_file.split(/\n/)
    lines_to_return = ary[ ary.size - n, ary.size - 1 ]

  end
end

I just wrote a quick implemenation with #seek :

class File
  def tail(n)
    buffer = 1024
    idx = (size - buffer).abs
    chunks = []
    lines = 0

    begin
      seek(idx)
      chunk = read(buffer)
      lines += chunk.count("\n")
      chunks.unshift chunk
      idx -= buffer
    end while lines < n && pos != 0

    chunks.join.lines.reverse_each.take(n).reverse.join
  end
end

File.open('rpn-calculator.rb') do |f|
  p f.tail(10)
end

Here's a version of tail that doesn't store any buffers in memory while you go, but instead uses "pointers". Also does bound-checking so you don't end up seeking to a negative offset (if for example you have more to read but less than your chunk size left).

def tail(path, n)
  file = File.open(path, "r")
  buffer_s = 512
  line_count = 0
  file.seek(0, IO::SEEK_END)

  offset = file.pos # we start at the end

  while line_count <= n && offset > 0
    to_read = if (offset - buffer_s) < 0
                offset
              else
                buffer_s
              end

    file.seek(offset-to_read)
    data = file.read(to_read)

    data.reverse.each_char do |c|
      if line_count > n
        offset += 1
        break
      end
      offset -= 1
      if c == "\n"
        line_count += 1
      end
    end
  end

  file.seek(offset)
  data = file.read
end

test cases at https://gist.github.com/shaiguitar/6d926587e98fc8a5e301

I can't vouch for Ruby but most of these languages follow the C idiom of file I/O. That means there's no way to do what you ask other than searching. This usually takes one of two approaches.

  • Starting at the start of the file and scanning it all, remembering the most recent 25 lines. Then, when you hit end of file, print them out.
  • A similar approach but attempting to seek to a best-guess location first. That means seeking to (for example) end of file minus 4000 characters, then doing exactly what you did in the first approach with the proviso that, if you didn't get 25 lines, you have to back up and try again (eg, to end of file minus 5000 characters).

The second way is the one I prefer since, if you choose your first offset wisely, you'll almost certainly only need one shot at it. Log files still tend to have fixed maximum line lengths (I think coders still have a propensity for 80-column files long after their usefulness has degraded). I tend to choose number of lines desired multiplied by 132 as my offset.

And from a cursory glance of Ruby docs online, it looks like it does follow the C idiom. You would use "ios.seek(25*-132,IO::SEEK_END)" if you were to follow my advice, then read forward from there.

I implemented a variation to Donald's code that works when n is larger than the number of lines in the file:

class MyFile < File

  def tail(n)
    buffer = 20000

    # Negative indices are not allowed:
    idx = [size - buffer, 0].max

    chunks = []
    lines = 0
    begin
      seek(idx)
      chunk = read(buffer)

      # Handle condition when file is empty:
      lines += chunk.nil? ? 0 : chunk.count("\n")

      chunks.unshift chunk

      # Limit next buffer's size when we've reached the start of the file,
      # to ensure two consecutive buffers don't overlap content,
      # and to ensure idx doesn't become negative:
      buffer = [buffer, idx].min

      idx -= buffer
    end while (lines < ( n + 1 )) && (pos != 0)

    tail_of_file = chunks.join('')
    ary = tail_of_file.split(/\n/)

    # Prevent trying to extract more lines than are in the file:
    n = [n, ary.size].min

    lines_to_return = ary[ ary.size - n, ary.size - 1 ]

  end
end

How about:

file = []
File.open("file.txt").each_line do |line|
  file << line
end

file.reverse.each_with_index do |line, index|
  puts line if index < 25
end

The performance would be awful over a big file as it iterates twice, the better approach would be the already mentioned read the file and store the last 25 lines in memory and display those. But this was just an alternative thought.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM