簡體   English   中英

將固定寬度的文件從文本轉換為 csv

[英]convert a fixed width file from text to csv

我有一個文本格式的大數據文件,我想通過指定每列長度將其轉換為 csv。

列數 = 5

柱長

[4 2 5 1 1]

樣本觀察:

aasdfh9013512
ajshdj 2445df

預期產出

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

GNU awk(gawk)直接使用FIELDWIDTHS支持這FIELDWIDTHS ,例如:

gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile

輸出:

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

我會使用sed並捕獲具有給定長度的組:

$ sed -r 's/^(.{4})(.{2})(.{5})(.{1})(.{1})$/\1,\2,\3,\4,\5/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

這是一個適用於常規awk的解決方案(不需要gawk )。

awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'

它使用awk的substr函數來定義每個字段的起始位置和長度。 OFS定義輸出字段分隔符(在本例中為逗號)。

(旁注:這僅在源數據沒有逗號時才有效。如果數據有逗號,則必須將它們轉義為正確的CSV,這超出了本問題的范圍。)

演示:

echo 'aasdfh9013512
ajshdj 2445df' | 
awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'

輸出:

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

如果還有人在尋找解決方案,我在python中開發了一個小腳本。 它很容易使用,只要你有python 3.5

https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py

  """
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
    Should have format as
    FieldName,fieldLength
    eg:
    FirstName,10
    SecondName,8
    Address,30
    etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys


def slices(s, args):
    position = 0
    for length in args:
        length = int(length)
        yield s[position:position + length]
        position += length

def extant_file(x):
    """
    'Type' for argparse - checks that file exists but does not open.
    """
    if not os.path.exists(x):
        # Argparse uses the ArgumentTypeError to give a rejection message like:
        # error: argument input: x does not exist
        raise argparse.ArgumentTypeError("{0} does not exist".format(x))
    return x





parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True,    help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False,    help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False,   help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False,   help="Provide the delimiter string you want",metavar="STRING", default="|")

args = parser.parse_args()

#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter

#Output file checks
if args.OutputFile is None:
    OutputFile = str(InputFile) + "Delimited.txt"
    print ("Setting Ouput file as "+ OutputFile)
else:
    OutputFile = args.OutputFile

#Config file check
if args.ConfigFile is None:
    if not os.path.exists("Config.txt"):
        print ("There is no Config File provided exiting the script")
        sys.exit()
    else:
        ConfigFile = "Config.txt"
        print ("Taking Config.txt file on this path as Default Config File")
else:
    ConfigFile = args.ConfigFile

fieldNames = []
fieldLength = []
myvars = OrderedDict()


with open(ConfigFile) as myfile:
    for line in myfile:
        name, var = line.partition(",")[::2]
        myvars[name.strip()] = int(var)
for key,value in myvars.items():
    fieldNames.append(key)
    fieldLength.append(value)

with open(OutputFile, 'w') as f1:
    fieldNames = DELIMITER.join(map(str, fieldNames))
    f1.write(fieldNames + "\n")
    with open(InputFile, 'r') as f:
        for line in f:
            rec = (list(slices(line, fieldLength)))
            myLine = DELIMITER.join(map(str, rec))
            f1.write(myLine + "\n")

便攜式awk

# Generate an awk script with the appropriate substr commands
$ cat cols
4
2
5
1
1
$ <cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1
substr($0,1,4)
substr($0,5,2)
substr($0,7,5)
substr($0,12,1)
substr($0,13,1)

# Combine lines and make it a valid awk-script
$ <cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1 |
  paste -sd, | sed 's/^/{ print /; s/$/ }/'
{ print substr($0,1,4),substr($0,5,2),substr($0,7,5),substr($0,12,1),substr($0,13,1) }

# Send this output to a file, e.g. /tmp/t.awk
# Now run it on the input file
$ <infile awk -f /tmp/t.awk
aasd fh 90135 1 2
ajsh dj  2445 d f

# With comma as the output separator
$ <infile awk -f /tmp/t.awk OFS=,
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

awk添加處理此問題的通用方法(替代 FIELDSWIDTH 選項)(我們不需要對子字符串位置進行硬編碼,這將根據用戶在需要插入逗號的任何位置提供的位置編號工作)可以如下所示,編寫和在 GNU awk測試。 要使用它,我們必須定義值(如示例中顯示的 OP),我們需要插入逗號的位置編號, awk變量名稱是colLength給出位置編號,它們之間有空格。

awk -v colLengh="4 2 5 1 1" '
BEGIN{
  num=split(colLengh,arr,OFS)
}
{
  j=sum=0
  while(++j<=num){
    if(length($0)>sum){
      sub("^.{"arr[j]+sum"}","&,")
    }
    sum+=arr[j]+1
  }
}
1
' Input_file

解釋:簡單的解釋是,創建名為colLengh awk變量,我們需要在需要插入逗號的任何位置定義位置編號。 然后在BEGIN部分創建數組arr ,它具有我們需要在其中插入逗號的索引值。

在主程序部分,首先在這里取消變量jsum 然后從 j=1 運行while循環,直到 j 的值等於 num。 在每次運行中,從當前行的開始替換(如果當前行的長度大於總和,否則執行替換是沒有意義的,因為我已經在此處添加了附加檢查)所有內容 + ,根據需要。 例如: sub將在第一次循環運行時變為.{4}然后變為.{7}因為它的第 7 個位置我們需要插入逗號等等。 因此sub將從開始到生成的數字替換那些具有匹配值 + ,字符。 最后在這個程序中提到1將打印已編輯/未編輯的行。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM