简体   繁体   中英

Parsing mixed flat file data to write into xls using Python

I've complex flat file with huge data of mixed type. Trying to parse it using Python (best known to me), Succeeded to segregate data categorically using manual parsing.

Now stuck at a point where I have extracted data and need to make it tabular so that I could write it into xls, using pandas or any other lib.

I have pasted data at pastebin, url is https://pastebin.com/qn9J5nUL

data comes in non-tabualr and tabular format, out of which I need to discard non-tabular data and only need to write tabular data into xls. To be precise I want to delete below data - ABC Command-----UIP BLOCK:; SE: ABC_UIOP_89TP Report: +ve ABC_UIOP_89TP 2016-09-23 15:16:14 O&M #998459350 %%/*Web=1571835373:;%% ID = 0 Result Ok. ABC Command-----UIP BLOCK:; SE: ABC_UIOP_89TP Report: +ve ABC_UIOP_89TP 2016-09-23 15:16:14 O&M #998459350 %%/*Web=1571835373:;%% ID = 0 Result Ok.

and only utilize below format data into xls (example, not exact. Please refer to pastebin url to see complete data format) -

Local Info ID  ID Name           ID Frequency           ID Data                My ID                  

0              XXX_1               0                       12                    13                        

Since your datafile has certain pattern i think you can do it this way.

import pandas
s = []
e = []
with open('data_to_be_parsed.txt') as f:
    datafile = f.readlines()
    for idx,line in enumerate(datafile):

        if 'Local' in line:
            s.append(idx)
        if '(Number of results' in line:
            e.append(idx)
    maindf = pd.DataFrame()
    for i in range(len(s)):
        head = list(datafile[s[i]].split("  "))
        head = [x for x in head if x.strip()]
        tmpdf = pd.DataFrame(columns=head)
        for l_ in range(s[i]+1,e[i]):
            da = datafile[l_]
            if len(da)>1:
                data = list(da.split("  "))
                data =  [x for x in data if x.strip()]
                tmpdf = tmpdf.append(dict(zip(head,data)),ignore_index=True)
        maindf = pd.concat([maindf,tempdf])
    maindf.to_excel("output.xlsx")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM