简体   繁体   English

Python3 连接多个文件并忽略头部和尾部记录

[英]Python3 Concatenate multiple files and ignore header & trailer records

I want to concatenate multiple files and skip the header and trailer records in all the files and have the column names(always in the 2nd line of the file) occur only once in the final file while concatenating.我想连接多个文件并跳过所有文件中的标题和尾部记录,并在连接时在最终文件中只出现一次列名(总是在文件的第二行)

I am able to concatenate,but how do i skip the header,trailer and retain the column names only once?Each file has about 25 million records.我可以连接,但是我如何跳过标题、尾部并只保留列名一次?每个文件有大约 2500 万条记录。

 File1.txt

    H,ABC,file1.txt
    Name,address,zipcode
    Rick,ABC,123
    Tom,XYZ,456
    T,2  -----------------record count

 File2.txt

    H,ABC,file2.txt
    Name,address,zipcode
    Jerry,ABC,123
    T,1


 File3.txt

    H,ABC,file3.txt
    Name,address,zipcode
    John,ABC,123
    Mike,XYZ,456
    T,2

 ***Final Output:***

    Name,address,zipcode
    Rick,ABC,123
    Tom,XYZ,456
    Jerry,ABC,123
    Harry,XYZ,456
    John,ABC,123
    Mike,XYZ,456

Code:代码:

filenames = ['File1.txt', 'File2.txt', 'file3.txt']
with open('output_file', 'w') as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

Using Python:使用 Python:

Here is a really simple method which uses pandas.read_csv to concatenate your TXT files and output to a single TXT file, using pandas.DataFrame.to_csv .下面是使用一个非常简单的方法pandas.read_csv到您的TXT文件和输出串联到一个TXT文件,使用pandas.DataFrame.to_csv

import pandas as pd
from glob import glob

df = pd.DataFrame()
files = glob('./addr_files/*.txt')

for f in files:
    df = df.append(pd.read_csv(f, skiprows=1, skipfooter=1, engine='python'))

df.to_csv('./addr_files/output.txt', index=False)

Output:输出:

(py35) ~/Desktop/so/addr_files
$ cat output.txt
Name,address,zipcode
Rick,ABC,123
Tom,XYZ,456
Jerry,ABC,123
Harry,XYZ,456
John,ABC,123
Mike,XYZ,456

Using GNU sed :使用 GNU sed

Here is another option which will stream the output of each file named file*.txt into a new file ( all.txt ), skipping the rows you want to miss;这是另一个选项,它将每个名为file*.txt文件的输出流式传输到一个新文件( all.txt )中,跳过您想要错过的行; specifically the 1st, 2nd and last.特别是第一个,第二个和最后一个。

Given your files are so large, you might want to add a couple printf statements for debugging, so you can see the which file is being processed, as the script loops the files.鉴于您的文件如此之大,您可能需要添加几个printf语句进行调试,这样您就可以在脚本循环文件时看到正在处理哪个文件。

#!/usr/bin/env bash

# Print the header to the output file.
sed -n 2p file1.txt > all.txt

# Stream (specific) content of all files to output file.
for f in $( ls file*.txt ); do sed '1d;2d;$d' $f >> all.txt; done

Output:输出:

(base) user@host ~/Desktop/so/concat                                                                             
$ cat all.txt
Name,address,zipcode
Rick,ABC,123
Tom,XYZ,456
Jerry,ABC,123
Harry,XYZ,456
John,ABC,123
Mike,XYZ,456

1)You can modify a little what you had done in the following: 1)您可以在以下内容中稍微修改一下:

filenames = ['File1.txt', 'File2.txt', 'file3.txt']
with open('output_file', 'w') as outfile:
    outille.write("Name,address,zipcode\n")
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                if line.find("Trailer record") < 0 and line.find("Name,address,zipcode") < 0 :
                    outfile.write(line)

2) Alternatively if you are familiar with grep command in unix you can use it. 2) 或者,如果您熟悉 unix 中的 grep 命令,您可以使用它。 You can use it directly in Python with the sh library and chaining the commands.您可以通过 sh 库直接在 Python 中使用它并链接命令。

The python way蟒蛇方式

Concatenate iterators so you don't have problems with huge files.连接迭代器,这样您就不会遇到大文件的问题。

import os
import fnmatch
from itertools import filterfalse
import csv

def get_files(pattern, path):
    """
    Get all files from path that match pattern
    """
    for path, _, filelist in os.walk(path):
        for name in fnmatch.filter(filelist, pattern):
                yield os.path.join(path, name)

def open_files(filenames):
    """
    Open all files that match pattern
    """
    for filename in filenames:
        file = open(filename, newline="")
        yield file
        file.close()

def get_csv(files):
    """
    Return csv reader for files
    """
    for file in files:
        lines = filterfalse(lambda line: line.startswith("H,") or line.startswith("T,"), file)
        reader = csv.DictReader(lines, delimiter=",")
        yield reader

def concatenate(iterators):
    """
    Concatenate iterators into a single sequence
    """
    for it in iterators:
        yield from it

with open('output.txt', 'w', newline="") as output:
    filenames = get_files('*.txt', '.')
    files = open_files(filenames)
    csvs = get_csv(files)
    lines = concatenate(csvs)
    
    fieldnames = ["Name","address","zipcode"]
    writer = csv.DictWriter(output, fieldnames=fieldnames)
    
    writer.writeheader()
    for line in lines:
        writer.writerow(line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM