简体   繁体   English

如何从非表格的文本文件中提取父子数据?

[英]How do I extract parent and child data from a text file that isn't tabular?

I have a UTF-8 encoded text file which contains a report output which I would like to get into a data frame.我有一个 UTF-8 编码的文本文件,其中包含一个报告 output 我想进入一个数据框。 The problem I have is that the data is not tabular it consists of parent lines and child line, page headers etc.我遇到的问题是数据不是表格,它由父行和子行、页眉等组成。

This is an example of the file layout, there are approx 2000 + records in the full file这是文件布局的一个示例,完整文件中有大约 2000 + 条记录

ACME LTD (SP)                       Report for Mexico                       Time 14:18:11     Date  04082019                                                                                    
Mexico                                                                                     *********/JOEOD Page           1                                                                                 

Cnno        Acct no         Tax number                  Address                                     

1       ABC3415         899111752                   Kellys Hair ONE ST JOHNS CHURCHYARD ED45 8LP LONDON                                     

PstDte          Docno           DocDte      Reference no            ClgDte  WT  code        Invoice amnt       Base amount   tax     Net amount  T  x-exempt amt    

    tax type:                       W1      tax code:                   WA                      

80519           5100002076          70519       20006874            50719   WA          1156961002  1156961003  76311439    1156961002  -1  
10619           5100002673          70519       20007095            50719   WA          2147567637  2147567637  144956394   2147567637  0   
******                                              WA          3304528639  330452864   221267833   3304528639  -1  

                                                ** ****         3304528639  330452864   221267833   3304528639  -1  


2       BFG4919         7880487069                  SPA LTD OHNSON HOUSE GREENBY SQHH1 3DF READING                                      

    tax type:                       W1      tax code:                   WA                      

30619           5100002672          30619       90331014            20719   WA          2260302 1883585 1260708 1883585 376717  
30619           5100002681          30619       90331015            20719   WA          73519295    61266079    4100618 61266079    12253216    
10719           5100002679          30619       90331016            20719   WA          105593207   87994339    5719633 87994339    17598868    
10719           5100002680          30619       90331017            20719   WA          82808594    69007162    4485466 69007162    13801432    
10719           5100003245          10719       90332783            300719  WA          80358636    6696553 4447229 6696553 13393106    
10719           5100003246          10719       90332782            300719  WA          102408262   85340218    5667505 85340218    17068044    
10719           5100003247          10719       90332781            300719  WA          73498752    6124896 4067587 6124896 12249792    
10719           5100003248          10719       90332780            300719  WA          22784614    18987178    1260952 18987178    3797436 
******                                              WA          56357438    469645316   31009698    469645316   93929064    

                                                ** ****         56357438    469645316   31009698    469645316   93929064    


3       KLU5437         6781754415                  BIRDS SERVICES LIMITED GREEN HOUSE REDCAR INDUSTEC4L 4HJ LONDON                                     

    tax type:                       CS      tax code:                   CS                      

110619          5100002956          120619      1975674         90719   CS          1839932 17523288    91166   17523288    876032  
10719           5100003373          120619      1975677         120719  CS          78940756    705990901   35886346    754108083   83416659    
10719           5100003391          120619      1975675         120719  CS          643442103   61280197    31149443    61280197    30640133    
******                                              CS          1451248983  1336316159  67947449    1384433341  114932824   

    tax type:                       W1      tax code:                   WA                      

110619          5100002956          120619      1975674         90719   WA          1839932 17523288    1185159 17523288    876032  
10719           5100003373          120619      1975677         120719  WA          78940756    754108084   49831859    754108083   35299476    
10719           5100003389          60619       1975671         120719  WA          368898403   368898403   24377001    368898403   0   
10719           5100003391          120619      1975675         120719  WA          643442103   61280197    40494277    61280197    30640133    
10719           5100003394          110619      1975678         120719  WA          1421290282  1421290283  93919609    1421290282  -1  
10719           5100003513          120619      1975676         190719  WA          172718664   172718664   11434027    172718664   0   
10719           5100003626          210619      1975693         260719  WA          276901444   25751819    17101966    276901444   19383254    
******                                              WA          3691057776  3604858882  238343898   3624242134  86198894    

    tax type:                       X1      tax code:                   XA                      

110619          5100002956          120619      1975674         90719   XA          1839932 17523288    91167   17523288    876032  
10719           5100003373          120619      1975677         120719  XA          78940756    754108084   383322  754108083   35299476    
10719           5100003389          60619       1975671         120719  XA          368898403   368898403   1875154 368898403   0   
10719           5100003391          120619      1975675         120719  XA          643442103   61280197    3114945 61280197    30640133    
10719           5100003394          110619      1975678         120719  XA          1421290282  1421290283  7224586 1421290282  -1  
10719           5100003513          120619      1975676         190719  XA          172718664   172718664   879541  172718664   0   
10719           5100003626          210619      1975693         260719  XA          276901444   25751819    1315536 276901444   19383254    
******                                              XA          3691057776  3604858882  18334149    3624242134  86198894    
ACME LTD (SP)                       Report for Mexico                       Time 14:18:11     Date  04082019                                                                                    
Mexico                                                                                     *********/JOEOD Page           2                                                                                     
Cnno        Acct no         Tax number                  Address                                     

3       KLU5437         6781754415                  BIRDS SERVICES LIMITED GREEN HOUSE REDCAR INDUSTEC4L 4HJ LONDON                                     

PstDte          Docno           DocDte      Reference no            ClgDte  WT  code        Invoice amnt       Base amount   Withholdtax     Net amount  T  x-exempt amt    


                                                ** ****         3691057776  8546033923  324625496   3624242134  -4854976147 


4       KLD15935            837960557                   BOJACK GROUP LTD HORSEMAN HOUSE SHADWELLGH12 3BB ABERDEEN                                       

    tax type:                       W1      tax code:                   WA                      

10719           5100003296          290519      82620012754         90719   WA          6863606446  6863606446  443122606   6863606446  0   
10719           5100003654          210619      82620013425         260719  WA          5854587092  585458709   381911219   5854587092  2   
******                                              WA          12718193538 12718193536 825033825   12718193538 2   

                                                ** ****         12718193538 12718193536 825033825   12718193538 2   


5       HDH943859                               Rover Energy Schweiz AG SWIZSTRASSE 345 1005 ZURICH                                     

    tax type:                       W1      tax code:                   WA                      

10719           5100003613          20419       2963427         260719  WA          2893481234  2893481234  190177614   2893481234  0   
10719           5100003614          20419       2963426         260719  WA          2893481234  2893481234  190177614   2893481234  0   
******                                              WA          5786962468  5786962468  380355228   5786962468  0   

                                                ** ****         5786962468  5786962468  380355228   5786962468  0   

I want to format the data into the following flat structure我想将数据格式化为以下平面结构

Cnno, Acct no, Tax number, Address, PstDte, Docno, DocDte, Reference no, clg date,tax type, WT code, Invoice amnt,Base amount,tax,Net amount,T  x-exempt amt

Frankly, apart from loading the data into a dataframe and removing the blank rows, I have not got far.坦率地说,除了将数据加载到 dataframe 并删除空白行之外,我还没有走多远。 I have looked but can't seem to find any similar examples so if anyone has any links tutorials dealing with similar data extraction issues that would be great, or if you have some ideas on how to tackle it that would be a start.我已经看过但似乎找不到任何类似的例子,所以如果有人有任何链接教程处理类似的数据提取问题,那将是很棒的,或者如果你有一些关于如何解决它的想法,那将是一个开始。

So after looking at if more the approach which i have taken to clean this is as follows因此,在查看了我采取的更多清洁方法之后,如下所示

Load to a df, there are no headings so the columns are just 0,1,2 lots of NaN etc加载到 df,没有标题,所以这些列只有 0、1、2 批 NaN 等

Remove any columns which are all NaN删除所有为 NaN 的列

df2 = df.dropna(axis = 0, how ='all').copy()

I wanted to keep the company name but not any of the other data, like report title or the county, so split the string to remove the text I didn't want and then created a mask for the rows containing Mexico and then filtered the df to remove them我想保留公司名称,但不保留任何其他数据,例如报告标题或县,因此拆分字符串以删除我不想要的文本,然后为包含墨西哥的行创建一个掩码,然后过滤 df删除它们

df2[0] = df2[0].str.split('  ').str[0]
mask = (df2[0] == 'Mexico')
df3 = df3[~mask].copy()

Then used ffill to copy the company name to each row of the df (There are multiple company names, the report does all the records for one company then the next and so on)然后使用 ffill 将公司名称复制到 df 的每一行(有多个公司名称,报告为一个公司做所有记录,然后是下一个,依此类推)

df3[0]=df3[0].fillna(method='ffill')

Column [1] contains the data for the parent record Cnno and the Child records Pstdte, these are numerics stored as text so i filtered this column using to_numeric, this removes all the headings and page numbers rows which are repeated though out the data leaving just the parent and child rows.列 [1] 包含父记录 Cnno 和子记录 Pstdte 的数据,这些是存储为文本的数字,因此我使用 to_numeric 过滤了此列,这将删除所有重复的标题和页码行,仅留下数据父行和子行。

df4 = df3[new_WHT2[[1]].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)].copy()

Then I created a new column 'Cnno' and populated it using然后我创建了一个新列'Cnno'并使用

df4.loc[new_WHT3[1]<9999, 'Cnno'] = df4[1]

both the Cnno and Pstdte are numbers, but as Pstdte is a 'date' the minimum length is 5, and Cnno is never larger than length 4 so I could use that to separate out the parent and the children rows Cnno 和 Pstdte 都是数字,但是由于 Pstdte 是“日期”,因此最小长度为 5,并且 Cnno 永远不会大于长度 4,因此我可以使用它来分隔父行和子行

As each parent row is followed by its children in the dataframe I could use ffill on 'Cnno' to copy down the parent Cnno to its children to associate the records由于每个父行后面是 dataframe 中的子行,因此我可以在“Cnno”上使用 ffill 将父 Cnno 复制到其子行以关联记录

df4['Cnno'] = df4['Cnno'].fillna(method='ffill')

I then created a parent column to identify the parent records (not strictly necessary)然后我创建了一个父列来标识父记录(不是绝对必要的)

df4['Parent'] = (df4[1]<9999).astype(int)

Then I filtered on the parent column and copied the data to a new df, removed any empty data, dropped the old data for cnno in column [1] and added new column headings for the rest.然后我过滤父列并将数据复制到新的 df,删除所有空数据,删除列 [1] 中 cnno 的旧数据,并为 rest 添加新的列标题。 As the parent row is repeated when there is a new page in the original file, there were multiple rows of the same data, so I dropped duplicates keeping the first only由于在原始文件中有新页面时重复父行,因此有多行相同的数据,所以我删除了重复项,只保留了第一个

Parent = df4[df4['Parent'] == 1].copy()
Parent = Parent.dropna(axis=1, how='all')
Parent = Parent.drop(Parent.columns[1] , axis=1)
Parent.columns = ['Company','Account No','Tax Code','Vendor Address','Cnno','Parent']
Parent.drop_duplicates(keep='first', inplace=True)

This then gives a clean df of just the parent records然后,这给出了仅父记录的干净 df

  Company, Account No, Tax Code, Vendor Address, Cnno, Parent
5 ACME Ltd, ABC3415, 899111752, Kellys Hair ONE ST JOHNS CHURCHYARD ED45 8LP LONDON, 1, 1 
18 ACME Ltd, BFG4919, 7880487069, SPA LTD OHNSON HOUSE GREENBY SQHH1 3DF READING, 2, 1 

I then basically did the same with the child records然后我基本上对孩子的记录做了同样的事情

Children = df4[df4['Parent'] != 1].copy()
Children = Children.dropna(axis=1, how='all')
Children.columns = ['Company','PstDte', 'DocNo','DocDte','Reference no','ClgDte','WT code','Invoice amnt','Base amount','tax','Net amount','T x-exempt amt','Cnno','Parent']

This gave me a clean df of all the child records, I then merged the parent and child records, using the key company and cnno这给了我所有子记录的干净 df,然后我使用密钥公司和 cnno 合并父记录和子记录

Final = pd.merge(Parent, Children,  how='left', left_on=['Company','Cnno'], right_on = ['Company','Cnno'])

After than it was just a case of formatting each of the date columns and any other bits of formatting, dtypes etc.之后,它只是格式化每个日期列和任何其他格式,dtypes等的情况。

Final['PstDte'] = Final['PstDte'].apply(lambda x: pd.to_datetime(str(x), format='%d%m%y'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM