[英]How do I extract parent and child data from a text file that isn't tabular?
I have a UTF-8 encoded text file which contains a report output which I would like to get into a data frame.我有一个 UTF-8 编码的文本文件,其中包含一个报告 output 我想进入一个数据框。 The problem I have is that the data is not tabular it consists of parent lines and child line, page headers etc.
我遇到的问题是数据不是表格,它由父行和子行、页眉等组成。
This is an example of the file layout, there are approx 2000 + records in the full file这是文件布局的一个示例,完整文件中有大约 2000 + 条记录
ACME LTD (SP) Report for Mexico Time 14:18:11 Date 04082019
Mexico *********/JOEOD Page 1
Cnno Acct no Tax number Address
1 ABC3415 899111752 Kellys Hair ONE ST JOHNS CHURCHYARD ED45 8LP LONDON
PstDte Docno DocDte Reference no ClgDte WT code Invoice amnt Base amount tax Net amount T x-exempt amt
tax type: W1 tax code: WA
80519 5100002076 70519 20006874 50719 WA 1156961002 1156961003 76311439 1156961002 -1
10619 5100002673 70519 20007095 50719 WA 2147567637 2147567637 144956394 2147567637 0
****** WA 3304528639 330452864 221267833 3304528639 -1
** **** 3304528639 330452864 221267833 3304528639 -1
2 BFG4919 7880487069 SPA LTD OHNSON HOUSE GREENBY SQHH1 3DF READING
tax type: W1 tax code: WA
30619 5100002672 30619 90331014 20719 WA 2260302 1883585 1260708 1883585 376717
30619 5100002681 30619 90331015 20719 WA 73519295 61266079 4100618 61266079 12253216
10719 5100002679 30619 90331016 20719 WA 105593207 87994339 5719633 87994339 17598868
10719 5100002680 30619 90331017 20719 WA 82808594 69007162 4485466 69007162 13801432
10719 5100003245 10719 90332783 300719 WA 80358636 6696553 4447229 6696553 13393106
10719 5100003246 10719 90332782 300719 WA 102408262 85340218 5667505 85340218 17068044
10719 5100003247 10719 90332781 300719 WA 73498752 6124896 4067587 6124896 12249792
10719 5100003248 10719 90332780 300719 WA 22784614 18987178 1260952 18987178 3797436
****** WA 56357438 469645316 31009698 469645316 93929064
** **** 56357438 469645316 31009698 469645316 93929064
3 KLU5437 6781754415 BIRDS SERVICES LIMITED GREEN HOUSE REDCAR INDUSTEC4L 4HJ LONDON
tax type: CS tax code: CS
110619 5100002956 120619 1975674 90719 CS 1839932 17523288 91166 17523288 876032
10719 5100003373 120619 1975677 120719 CS 78940756 705990901 35886346 754108083 83416659
10719 5100003391 120619 1975675 120719 CS 643442103 61280197 31149443 61280197 30640133
****** CS 1451248983 1336316159 67947449 1384433341 114932824
tax type: W1 tax code: WA
110619 5100002956 120619 1975674 90719 WA 1839932 17523288 1185159 17523288 876032
10719 5100003373 120619 1975677 120719 WA 78940756 754108084 49831859 754108083 35299476
10719 5100003389 60619 1975671 120719 WA 368898403 368898403 24377001 368898403 0
10719 5100003391 120619 1975675 120719 WA 643442103 61280197 40494277 61280197 30640133
10719 5100003394 110619 1975678 120719 WA 1421290282 1421290283 93919609 1421290282 -1
10719 5100003513 120619 1975676 190719 WA 172718664 172718664 11434027 172718664 0
10719 5100003626 210619 1975693 260719 WA 276901444 25751819 17101966 276901444 19383254
****** WA 3691057776 3604858882 238343898 3624242134 86198894
tax type: X1 tax code: XA
110619 5100002956 120619 1975674 90719 XA 1839932 17523288 91167 17523288 876032
10719 5100003373 120619 1975677 120719 XA 78940756 754108084 383322 754108083 35299476
10719 5100003389 60619 1975671 120719 XA 368898403 368898403 1875154 368898403 0
10719 5100003391 120619 1975675 120719 XA 643442103 61280197 3114945 61280197 30640133
10719 5100003394 110619 1975678 120719 XA 1421290282 1421290283 7224586 1421290282 -1
10719 5100003513 120619 1975676 190719 XA 172718664 172718664 879541 172718664 0
10719 5100003626 210619 1975693 260719 XA 276901444 25751819 1315536 276901444 19383254
****** XA 3691057776 3604858882 18334149 3624242134 86198894
ACME LTD (SP) Report for Mexico Time 14:18:11 Date 04082019
Mexico *********/JOEOD Page 2
Cnno Acct no Tax number Address
3 KLU5437 6781754415 BIRDS SERVICES LIMITED GREEN HOUSE REDCAR INDUSTEC4L 4HJ LONDON
PstDte Docno DocDte Reference no ClgDte WT code Invoice amnt Base amount Withholdtax Net amount T x-exempt amt
** **** 3691057776 8546033923 324625496 3624242134 -4854976147
4 KLD15935 837960557 BOJACK GROUP LTD HORSEMAN HOUSE SHADWELLGH12 3BB ABERDEEN
tax type: W1 tax code: WA
10719 5100003296 290519 82620012754 90719 WA 6863606446 6863606446 443122606 6863606446 0
10719 5100003654 210619 82620013425 260719 WA 5854587092 585458709 381911219 5854587092 2
****** WA 12718193538 12718193536 825033825 12718193538 2
** **** 12718193538 12718193536 825033825 12718193538 2
5 HDH943859 Rover Energy Schweiz AG SWIZSTRASSE 345 1005 ZURICH
tax type: W1 tax code: WA
10719 5100003613 20419 2963427 260719 WA 2893481234 2893481234 190177614 2893481234 0
10719 5100003614 20419 2963426 260719 WA 2893481234 2893481234 190177614 2893481234 0
****** WA 5786962468 5786962468 380355228 5786962468 0
** **** 5786962468 5786962468 380355228 5786962468 0
I want to format the data into the following flat structure我想将数据格式化为以下平面结构
Cnno, Acct no, Tax number, Address, PstDte, Docno, DocDte, Reference no, clg date,tax type, WT code, Invoice amnt,Base amount,tax,Net amount,T x-exempt amt
Frankly, apart from loading the data into a dataframe and removing the blank rows, I have not got far.坦率地说,除了将数据加载到 dataframe 并删除空白行之外,我还没有走多远。 I have looked but can't seem to find any similar examples so if anyone has any links tutorials dealing with similar data extraction issues that would be great, or if you have some ideas on how to tackle it that would be a start.
我已经看过但似乎找不到任何类似的例子,所以如果有人有任何链接教程处理类似的数据提取问题,那将是很棒的,或者如果你有一些关于如何解决它的想法,那将是一个开始。
So after looking at if more the approach which i have taken to clean this is as follows因此,在查看了我采取的更多清洁方法之后,如下所示
Load to a df, there are no headings so the columns are just 0,1,2 lots of NaN etc加载到 df,没有标题,所以这些列只有 0、1、2 批 NaN 等
Remove any columns which are all NaN删除所有为 NaN 的列
df2 = df.dropna(axis = 0, how ='all').copy()
I wanted to keep the company name but not any of the other data, like report title or the county, so split the string to remove the text I didn't want and then created a mask for the rows containing Mexico and then filtered the df to remove them我想保留公司名称,但不保留任何其他数据,例如报告标题或县,因此拆分字符串以删除我不想要的文本,然后为包含墨西哥的行创建一个掩码,然后过滤 df删除它们
df2[0] = df2[0].str.split(' ').str[0]
mask = (df2[0] == 'Mexico')
df3 = df3[~mask].copy()
Then used ffill to copy the company name to each row of the df (There are multiple company names, the report does all the records for one company then the next and so on)然后使用 ffill 将公司名称复制到 df 的每一行(有多个公司名称,报告为一个公司做所有记录,然后是下一个,依此类推)
df3[0]=df3[0].fillna(method='ffill')
Column [1] contains the data for the parent record Cnno and the Child records Pstdte, these are numerics stored as text so i filtered this column using to_numeric, this removes all the headings and page numbers rows which are repeated though out the data leaving just the parent and child rows.列 [1] 包含父记录 Cnno 和子记录 Pstdte 的数据,这些是存储为文本的数字,因此我使用 to_numeric 过滤了此列,这将删除所有重复的标题和页码行,仅留下数据父行和子行。
df4 = df3[new_WHT2[[1]].apply(pd.to_numeric, errors='coerce').notnull().all(axis=1)].copy()
Then I created a new column 'Cnno' and populated it using然后我创建了一个新列'Cnno'并使用
df4.loc[new_WHT3[1]<9999, 'Cnno'] = df4[1]
both the Cnno and Pstdte are numbers, but as Pstdte is a 'date' the minimum length is 5, and Cnno is never larger than length 4 so I could use that to separate out the parent and the children rows Cnno 和 Pstdte 都是数字,但是由于 Pstdte 是“日期”,因此最小长度为 5,并且 Cnno 永远不会大于长度 4,因此我可以使用它来分隔父行和子行
As each parent row is followed by its children in the dataframe I could use ffill on 'Cnno' to copy down the parent Cnno to its children to associate the records由于每个父行后面是 dataframe 中的子行,因此我可以在“Cnno”上使用 ffill 将父 Cnno 复制到其子行以关联记录
df4['Cnno'] = df4['Cnno'].fillna(method='ffill')
I then created a parent column to identify the parent records (not strictly necessary)然后我创建了一个父列来标识父记录(不是绝对必要的)
df4['Parent'] = (df4[1]<9999).astype(int)
Then I filtered on the parent column and copied the data to a new df, removed any empty data, dropped the old data for cnno in column [1] and added new column headings for the rest.然后我过滤父列并将数据复制到新的 df,删除所有空数据,删除列 [1] 中 cnno 的旧数据,并为 rest 添加新的列标题。 As the parent row is repeated when there is a new page in the original file, there were multiple rows of the same data, so I dropped duplicates keeping the first only
由于在原始文件中有新页面时重复父行,因此有多行相同的数据,所以我删除了重复项,只保留了第一个
Parent = df4[df4['Parent'] == 1].copy()
Parent = Parent.dropna(axis=1, how='all')
Parent = Parent.drop(Parent.columns[1] , axis=1)
Parent.columns = ['Company','Account No','Tax Code','Vendor Address','Cnno','Parent']
Parent.drop_duplicates(keep='first', inplace=True)
This then gives a clean df of just the parent records然后,这给出了仅父记录的干净 df
Company, Account No, Tax Code, Vendor Address, Cnno, Parent
5 ACME Ltd, ABC3415, 899111752, Kellys Hair ONE ST JOHNS CHURCHYARD ED45 8LP LONDON, 1, 1
18 ACME Ltd, BFG4919, 7880487069, SPA LTD OHNSON HOUSE GREENBY SQHH1 3DF READING, 2, 1
I then basically did the same with the child records然后我基本上对孩子的记录做了同样的事情
Children = df4[df4['Parent'] != 1].copy()
Children = Children.dropna(axis=1, how='all')
Children.columns = ['Company','PstDte', 'DocNo','DocDte','Reference no','ClgDte','WT code','Invoice amnt','Base amount','tax','Net amount','T x-exempt amt','Cnno','Parent']
This gave me a clean df of all the child records, I then merged the parent and child records, using the key company and cnno这给了我所有子记录的干净 df,然后我使用密钥公司和 cnno 合并父记录和子记录
Final = pd.merge(Parent, Children, how='left', left_on=['Company','Cnno'], right_on = ['Company','Cnno'])
After than it was just a case of formatting each of the date columns and any other bits of formatting, dtypes etc.之后,它只是格式化每个日期列和任何其他格式,dtypes等的情况。
Final['PstDte'] = Final['PstDte'].apply(lambda x: pd.to_datetime(str(x), format='%d%m%y'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.