如何批量缓冲和处理tail -f输出？

Question

I need to monitor a file and send what is written to it to a web service.我需要监视文件并将写入其中的内容发送到 Web 服务。 I'm trying to achieve a clean and simple solution with bash scripting, eg:我正在尝试使用 bash 脚本实现一个干净简单的解决方案，例如：

#!/bin/bash

# listen for changes on file specified as first argument
tail -F "$1" | while read LINE
do
  curl http://service.com/endpoint --data "${LINE}"
done

This works perfectly, as in.. every line which is appended will be POST'ed to http://service.com/endpoint .这非常有效，因为在 .. 附加的每一行都将被发布到http://service.com/endpoint 。 However, I don't really like the fact that if many lines are appended in a short time, I will have as many HTTP requests and possibly overload the service.但是，我真的不喜欢这样一个事实，即如果在短时间内附加了许多行，我将收到同样多的 HTTP 请求并可能使服务过载。

Is there a smart way to kind of buffer the operations?有没有一种聪明的方法来缓冲操作？ I can think of something like:我可以想到这样的事情：

buffer = EMPTY
while LINES are read:
  add LINE to buffer
  if buffer has more than X LINES
    send POST
  fi
done

But in the above solution if one line is posted per hour, I will only get updates every X hours, which is not acceptable.但是在上面的解决方案中，如果每小时发布一行，我只会每 X 小时更新一次，这是不可接受的。 Another similar solution would be to "time" within the while loop: if X seconds have passed then send buffer, otherwise wait .. but the last line of a stream may be held indefinitely since the while loop is triggered only when a new line is added to the file.另一个类似的解决方案是在 while 循环中“计时”： if X seconds have passed then send buffer, otherwise wait .. 但流的最后一行可能会被无限期保留，因为只有当新行出现时才会触发 while 循环添加到文件中。

The objective is to do this with minimal bash scripting and without using a second process .目标是使用最少的 bash 脚本并且不使用第二个进程来做到这一点。 By second process I mean: process 1 gets the output from tail -f and stores it and process 2 periodically checks what is stored and sends a POST if more than x seconds are elapsed ?第二个进程我的意思是： process 1 gets the output from tail -f and stores it ， process 2 periodically checks what is stored and sends a POST if more than x seconds are elapsed ？

I am curious if this is possible by some clever trick?我很好奇这是否可以通过一些巧妙的技巧实现？

Thanks!谢谢！

Answer 1

Literally putting your pseudocode to code:把你的伪代码写成代码：

# add stdbuf -oL if you care
tail -F "$1" | {
    # buffer = EMPTY
    buffer=
    # while LINES are read:
    while IFS= read -r line; do
      # add LINE to buffer
      buffer+="$line"$'\n'
      # if buffer has more than X LINES
      # TODO: cache the count of lines in a variable to save cpu
      if [ $(wc -l <<<"$buffer") -gt "$x_lines" ]; then
          # send POST
          # TODO: remove additional newline on the end of buffer, if needed
          curl http://service.com/endpoint --data "${buffer}"
          buffer=
      fi
    done
}

Removing the newline on the end of the buffer or for example buffering the number of lines in a separate counter to save cpu is left for others.删除缓冲区末尾的换行符或例如在单独的计数器中缓冲行数以节省 cpu 留给其他人。

Notes:笔记：

Uppercase variables by convention is reserved to global, exported variables.按照惯例，大写变量保留给全局导出变量。
while read LINE will remove leading and trailing whitespaces from the line. while read LINE将从行中删除前导和尾随空格。 Use while IFS= read -r line to read the whole line.使用while IFS= read -r line读取整行。 More info at bashfaq how to read a file line by line bashfaq 上的更多信息如何逐行读取文件
With one line I believe you could just use xargs like tail -F "$1" | xargs -d$'\\n' -n1 curl http://service.com/endpoint --data用一行，我相信你可以只使用xargs类的tail -F "$1" | xargs -d$'\\n' -n1 curl http://service.com/endpoint --data tail -F "$1" | xargs -d$'\\n' -n1 curl http://service.com/endpoint --data

To buffer with the time, timeout the reading - either with bash extension, ex.要缓冲时间，请超时读取 - 使用 bash 扩展名，例如。 read -t 0.1 or by timeouting the whole read timeout 1 cat . read -t 0.1或通过使整个 read timeout 1 cat 。

To limit in both ways, the number of lines and with the timeout, I once written a badly named script called ratelimit.sh (badly named, because it does not limit rate...), that does exactly that.为了以两种方式限制行数和超时，我曾经编写了一个名为ratelimit.sh 的命名错误的脚本（命名错误，因为它不限制速率......），它正是这样做的。 It reads lines, and if either count of lines or timeout is reached, it flushes it's buffer with additional output separator.它读取行，如果达到行数或超时，它会用额外的输出分隔符刷新它的缓冲区。 I believe it's meant to be used like tail -F "$1" | ratelimit.sh --timeout=0.5 --lines=5 | while IFS= read -r -d $'\\x02' buffer; do curl ... --data "$buffer"; done我相信它应该像tail -F "$1" | ratelimit.sh --timeout=0.5 --lines=5 | while IFS= read -r -d $'\\x02' buffer; do curl ... --data "$buffer"; done tail -F "$1" | ratelimit.sh --timeout=0.5 --lines=5 | while IFS= read -r -d $'\\x02' buffer; do curl ... --data "$buffer"; done tail -F "$1" | ratelimit.sh --timeout=0.5 --lines=5 | while IFS= read -r -d $'\\x02' buffer; do curl ... --data "$buffer"; done . tail -F "$1" | ratelimit.sh --timeout=0.5 --lines=5 | while IFS= read -r -d $'\\x02' buffer; do curl ... --data "$buffer"; done 。 It roughly works like this:它的工作原理大致如下：

# Written by Kamil Cukrowski (C) 2020
# Licensed jointly under MIT and Beerware license
# config
maxtimeoutns=$((2 * 1000 * 1000 * 1000))
maxlines=5 
input_separator=$'\n'
output_separator=$'\x02'

# the script
timeout_arg=()
while true; do
    chunk=""
    lines=0
    start=$(date +%s%N)
    stop=$((start + maxtimeoutns))

    while true; do

        if [ "$maxtimeoutns" != 0 ]; then
            now=$(date +%s%N)
            if (( now >= stop )); then
                break
            fi
            timeout=$(( stop - now ))
            timeout=$(awk -va=$timeout -vb=1000000000 '{print "%f", a/b}' <<<"")
            timeout_arg=(-t "$timeout")
        fi


        IFS= read -rd "$input_separator" "${timeout_arg[@]}" line && ret=$? || ret=$?

        if (( ret == 0 )); then

            # read succeded
            chunk+=$line$'\n'

            if (( maxlines != 0 )); then
                lines=$((lines + 1))
                if (( lines >= maxlines )); then
                    break
                fi
            fi

        elif (( ret > 128 )); then
            # read timeouted
            break;
        fi
    done

    if (( ${#chunk} != 0 )); then
        printf "%s%s" "$chunk" "$output_separator"
    fi

done

Answer 2

Thanks to KamilCuk 's answer, I managed to achieve what I wanted in a rather simple way, combining max number of lines and timeouts.感谢KamilCuk的回答，我设法以一种相当简单的方式实现了我想要的，结合了最大行数和超时。 The trick was to discover that the piping doesn't necessarily work by lines, like I thought it did..silly me!诀窍是发现管道不一定按行工作，就像我认为的那样......我太傻了！

Just for future reference this is my solution which is very specific and simplified to the bone:仅供将来参考，这是我的解决方案，非常具体且简化到骨骼：

#!/bin/bash
# sends updates to $1 via curl every 15 seconds or every 100 lines
tail -F "$1" | while true; do

    chunk=""
    stop=$((`date +%s` + 15))
    maxlines=100

    while true; do

        if (( `date +%s` >= stop )); then break; fi

        IFS= read -r -t 15 line && ret=$? || ret=$?         
        if (( ret == 0 )); then

                chunk+=$line$'\n'
                maxlines=$((maxlines - 1))
                if (( maxlines == 0 )); then break; fi

        elif (( ret > 128 )); then break; fi

    done

    if (( ${#chunk} != 0 )); then
        curl http://service.com --data "$chunk";
    fi

done

如何批量缓冲和处理tail -f输出？

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-01-06 21:57:51

解决方案2
0 2020-01-07 09:43:28

如何批量缓冲和处理tail -f输出？

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-01-06 21:57:51

解决方案2 0 2020-01-07 09:43:28

解决方案1
2 已采纳 2020-01-06 21:57:51

解决方案2
0 2020-01-07 09:43:28