将多个文件连接成一个文件，但跳过重新附加已经存在的内容

Question

我有以下具有以下内容的文件（每个文件一行：

<189>162：CSR-1000V：*Sep 27 06:17:02：%LINEPROTO-5-UPDOWN：接口 Loopback317 上的线路协议，将 state 更改为向上 <189>165：CSR-1000V：*Sep 27 06:17 :07: %LINEPROTO-5-UPDOWN: 接口 Loopback320 上的线路协议，将 state 更改为向上 <189>164: CSR-1000V: *Sep 27 06:17:06: %LINEPROTO-5-UPDOWN: 接口 Loopback319 上的线路协议, 将 state 更改为向上 <189>161: CSR-1000V: *Sep 27 06:16:59: %LINEPROTO-5-UPDOWN: 接口 Loopback316 上的线路协议，将 state 更改为向上 <108:163C-108 9 月 27 日 06:17:04：%LINEPROTO-5-UPDOWN：接口环路上的线路协议

我想创建一个 python 脚本，可以将这些脚本添加到单个文件（ output.txt ），但我被卡住了，因为我正在使用 for 循环并且脚本不断添加现有的行

有任何想法吗？

谢谢

Answer 1

流程正如您在附件中看到的，apache nifi 中有一个数据管道，带有“ExecuteScript”处理器，我在其中运行上述 python 代码。 我描述的问题是文件中的现有行不断添加

Answer 2

可以处理的方法不止一种，但这取决于您的环境：

第一个：通读目录中的文件和append的数据到你的output文件中。 然后，使用 pickle 或 json 将已读取的文件保存在字典中并保存在光盘上。 下次调用您的代码 getc 时，解析该列表并跳过您保存在该列表中的文件。 （PS：使用 Python 进行文件处理，其用例）

第二个： 将新创建的文件作为参数传递，如果它适合你（我对apache-nifi一无所知）

第三个：将这些行与 output 文件中的行进行比较，但这会消耗很多性能并且可能非常不可靠。

第四种：将已经读取的文件移动到子目录中。

我会选择方法一，因为它非常简单直接。

编辑：我做了一段代码（没有测试它），如果它不能开箱即用，应该清楚该怎么做。

import json
import os

directory = "/home/adrian/from_hdfs/"

parsed = {}
with open('data.txt') as json_file:
    parsed = json.load(json_file)


#open output file
with open("finalfile.txt", "a") as outfile:

    #loop through src directory
    for filename in os.listdir(directory):
        if filename in parsed: 
            continue # skip file if already read

        file_abs = os.path.join(directory, filename)

        #print("Reading file: "+file_abs)
        with open(file_abs, "r") as src_file:
            myfile.write(src_file.read()) #append data from src to dest
            parsed[filename] = 1



with open('result.json', 'w') as fp:
    json.dump(parsed, fp)

Answer 3

#CODE:

#!/usr/bin/python

import subprocess
import json
import os


subprocess.call('cd /home/adrian/from_hdfs; for f in *; do (cat "${f}"; echo) >> notfinal.txt; done', shell=True) =====> I am using this to generate "data.txt" from your example

directory = "/home/adrian/from_hdfs/"

parsed = {}
with open('/home/adrian/from_hdfs/notfinal.txt') as json_file:
    parsed = json.load(json_file)


#open output file
with open("finalfile.txt", "a") as outfile:

    #loop through src directory
    for filename in os.listdir(directory):
        if filename in parsed: 
            continue # skip file if already read

        file_abs = os.path.join(directory, filename)

        #print("Reading file: "+file_abs)
        with open(file_abs, "r") as src_file:
            myfile.write(src_file.read()) #append data from src to dest
            parsed[filename] = 1



with open('result.json', 'w') as fp:
    json.dump(parsed, fp)



Traceback (most recent call last):
  File "./script.py", line 14, in <module>
    parsed = json.load(json_file)
  File "/usr/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

将多个文件连接成一个文件，但跳过重新附加已经存在的内容

问题描述

3 个解决方案

解决方案1
0 2019-09-27 07:13:06

解决方案2
0 2019-09-27 07:31:49

解决方案3
0 2019-09-27 08:25:19

将多个文件连接成一个文件，但跳过重新附加已经存在的内容

问题描述

3 个解决方案

解决方案1 0 2019-09-27 07:13:06

解决方案2 0 2019-09-27 07:31:49

解决方案3 0 2019-09-27 08:25:19

解决方案1
0 2019-09-27 07:13:06

解决方案2
0 2019-09-27 07:31:49

解决方案3
0 2019-09-27 08:25:19