I have the following dataframe called v
.
date x1 x2 x3 x4 x5 dname
1 20200705 8119 8013 8133 8031 100806 D1
2 20200706 8031 7950 8271 8200 443809 D1
3 20200707 8200 8188 8281 8217 303151 D1
4 20200708 8217 8200 8365 8334 509629 D1
5 20200709 8334 8139 8370 8204 588634 D1
.................................
55 20221216 17340 16675 17525 16775 7266 D2
56 20221219 16690 16395 16770 16495 4393 D2
57 20221220 16325 16275 17095 16840 5601 D2
58 20221221 16870 16670 16885 16735 2295 D2
59 20221222 16725 16470 16850 16485 3359 D2
.................................
125 20200705 9131 9000 9146 9014 D3
126 20200706 9014 8918 9352 9277 D3
127 20200707 9277 9207 9379 9255 D3
128 20200708 9255 9231 9473 9430 D3
129 20200709 9430 9165 9472 9237 D3
.................................
500 20221218 1179 1173 1197 1183 D7
501 20221219 1183 1165 1195 1176 D7
502 20221220 1176 1151 1229 1216 D7
503 20221221 1216 1204 1222 1212 D7
504 20221222 1212 1183 1221 1186 D7
.................................
992 D8
993 20200721 181 D9
994 20200818 50 D9
995 20200831 96 D9
996 20200925 84 D9
.................................
1006 20220705 36 D11
1007 20220718 48 D11
1008 20220728 22 D11
1009 20220818 68 D11
1010 20220923 108 D11
As you can see there are certain columns missing. Sometimes x1 - x4 are missing, sometimes x5 is missing, when they are missing they have a blank space character. Sometimes x2-x3 are missing.
I want to create one dataframe each and group up each frame based on which columns they have. So for example all those rows which have all columns will have is on frame, then those without x5 will have it's own column etc.
Right now I am manually programming each case. Is there a way to dynamically program this behaviour?
Here is my code,
import pandas as pd
v = pd.read_csv(filepath)
d1 = v[v.x5 == " "]
d2 = v[v.x5 != " "]
d3 = v[v.x2 != " " & v.x3 != " "]
I have to manually also go see which combination of missing columns exist before I do that. I have many dataframes like that.
Is there a faster more efficient way to do it so I end up with multiple dataframes like this where each dataframe has the same columns of data not missing.
df1
date x1 x2 x3 x4 x5 dname
1 20200705 8119 8013 8133 8031 100806 D1
2 20200706 8031 7950 8271 8200 443809 D1
3 20200707 8200 8188 8281 8217 303151 D1
4 20200708 8217 8200 8365 8334 509629 D1
5 20200709 8334 8139 8370 8204 588634 D1
.................................
55 20221216 17340 16675 17525 16775 7266 D2
56 20221219 16690 16395 16770 16495 4393 D2
57 20221220 16325 16275 17095 16840 5601 D2
58 20221221 16870 16670 16885 16735 2295 D2
59 20221222 16725 16470 16850 16485 3359 D2
df2
date x1 x2 x3 x4 dname
125 20200705 9131 9000 9146 9014 D3
126 20200706 9014 8918 9352 9277 D3
127 20200707 9277 9207 9379 9255 D3
128 20200708 9255 9231 9473 9430 D3
129 20200709 9430 9165 9472 9237 D3
.................................
500 20221218 1179 1173 1197 1183 D7
501 20221219 1183 1165 1195 1176 D7
502 20221220 1176 1151 1229 1216 D7
503 20221221 1216 1204 1222 1212 D7
504 20221222 1212 1183 1221 1186 D7
etc.
If you're trying to check for columns missing values, you can create an indicator column, showing which columns are missing using this code:
df['group'] = df.isna().apply(lambda x: ','.join(set(x[x].to_dict().keys())), axis = 1)
Will give you a df similar to this:
date x1 x2 x3 x4 x5 dname group
1 20200705 8119.0 8013.0 8133.0 8031.0 100806.0 D1
2 20200706 8031.0 7950.0 8271.0 8200.0 443809.0 D1
3 20200707 8188.0 8281.0 8217.0 303151.0 D1 x1
4 20200708 8365.0 8334.0 509629.0 D1 x1,x2
5 20200709 8370.0 8204.0 588634.0 D1 x1,x2
55 20221216 17340.0 17525.0 16775.0 7266.0 D2 x2
56 20221219 16690.0 16770.0 16495.0 4393.0 D2 x2
57 20221220 16325.0 16275.0 17095.0 16840.0 5601.0 D2
58 20221221 16870.0 16670.0 16885.0 16735.0 2295.0 D2
59 20221222 16725.0 16850.0 16485.0 3359.0 D2 x2
125 20200705 9131.0 9146.0 9014.0 D3 x5,x2
126 20200706 9014.0 9352.0 9277.0 D3 x5,x2
127 20200707 9277.0 D3 x5,x2,x3,x4
128 20200708 9255.0 D3 x5,x2,x3,x4
129 20200709 9430.0 D3 x5,x2,x3,x4
500 20221218 1179.0 1173.0 1197.0 1183.0 D3 x5
501 20221219 1183.0 1165.0 1195.0 1176.0 D3 x5
502 20221220 1176.0 1229.0 D3 x5,x2,x4
503 20221221 1216.0 D3 x5,x2,x3,x4
504 20221222 1212.0 1183.0 1221.0 D3 x5,x4
You can then try to split it using unique values of this indicator column, and appending results to a single list:
output = []
for group in df['group'].unique():
output.append(df[df['group'] == group].copy())
Output list should contain n number of dataframes, where n is number of unique combination of missing columns in your original df.
Note: you could replace df.isna()
with another statement that returns a dataframe containing True/False values, such as conditional (df == " ").apply(lambda x: ','.join(set(x[x].to_dict().keys())), axis = 1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.