[英]python pandas analyze dataframe
我有这个数据集(105233行x 32列矩阵),从中删除了带有.drop的第一列。 在这一点上,我应该做的是分析每一行(一个由32个组件组成的数组),并查找前16个项等于后16个项。
{
import pandas as pd
import numpy as np
data = pd.read_csv('enummixed.txt', header = None, low_memory=False)
data = data.drop(data.columns[[0]], axis=1)
print data
1 2 3 4 5 6 7 8 9 10 ... \
0 1 0 0 0 0 0 0 0 0 0 ...
1 1 0 0 0 0 0 0 0 0 0 ...
2 1 0 0 0 0 0 0 0 0 0 ...
3 1 0 0 0 0 0 0 0 0 0 ...
4 1 0 0 0 0 0 0 0 0 0 ...
5 1 0 0 0 0 0 0 0 0 0 ...
6 1 0 0 0 0 0 0 0 0 0 ...
7 1 0 0 0 0 0 0 0 0 0 ...
8 106/243 137/243 0 0 0 0 0 0 0 0 ...
9 106/243 137/243 0 0 0 0 0 0 0 0 ...
10 106/243 137/243 0 0 0 0 0 0 0 0 ...
11 106/243 137/243 0 0 0 0 0 0 0 0 ...
12 106/243 137/243 0 0 0 0 0 0 0 0 ...
13 106/243 137/243 0 0 0 0 0 0 0 0 ...
14 106/243 137/243 0 0 0 0 0 0 0 0 ...
15 106/243 137/243 0 0 0 0 0 0 0 0 ...
16 106/243 137/243 0 0 0 0 0 0 0 0 ...
17 106/243 137/243 0 0 0 0 0 0 0 0 ...
18 106/243 137/243 0 0 0 0 0 0 0 0 ...
19 106/243 137/243 0 0 0 0 0 0 0 0 ...
20 106/243 137/243 0 0 0 0 0 0 0 0 ...
21 106/243 137/243 0 0 0 0 0 0 0 0 ...
22 106/243 137/243 0 0 0 0 0 0 0 0 ...
23 106/243 137/243 0 0 0 0 0 0 0 0 ...
24 106/243 137/243 0 0 0 0 0 0 0 0 ...
25 106/243 137/243 0 0 0 0 0 0 0 0 ...
26 106/243 137/243 0 0 0 0 0 0 0 0 ...
27 106/243 137/243 0 0 0 0 0 0 0 0 ...
28 106/243 137/243 0 0 0 0 0 0 0 0 ...
29 106/243 137/243 0 0 0 0 0 0 0 0 ...
... ... ... .. .. .. .. .. .. .. .. ...
105203 0 0 0 0 0 0 0 0 0 0 ...
105204 0 0 0 0 0 0 0 0 0 0 ...
105205 0 0 0 0 0 0 0 0 0 0 ...
105206 0 0 0 0 0 0 0 0 0 0 ...
105207 0 0 0 0 0 0 0 0 0 0 ...
105208 0 0 0 0 0 0 0 0 0 0 ...
105209 0 0 0 0 0 0 0 0 0 0 ...
105210 0 0 0 0 0 0 0 0 0 0 ...
105211 0 0 0 0 0 0 0 0 0 0 ...
105212 0 0 0 0 0 0 0 0 0 0 ...
105213 0 0 0 0 0 0 0 0 0 0 ...
105214 0 0 0 0 0 0 0 0 0 0 ...
105215 0 0 0 0 0 0 0 0 0 0 ...
105216 0 0 0 0 0 0 0 0 0 0 ...
105217 0 0 0 0 0 0 0 0 0 0 ...
105218 0 0 0 0 0 0 0 0 0 0 ...
105219 0 0 0 0 0 0 0 0 0 0 ...
105220 0 0 0 0 0 0 0 0 0 0 ...
105221 0 0 0 0 0 0 0 0 0 0 ...
105222 0 0 0 0 0 0 0 0 0 0 ...
105223 0 0 0 0 0 0 0 0 0 0 ...
105224 0 0 0 0 0 0 0 0 0 0 ...
105225 0 0 0 0 0 0 0 0 0 0 ...
105226 0 0 0 0 0 0 0 0 0 0 ...
105227 0 0 0 0 0 0 0 0 0 0 ...
105228 0 0 0 0 0 0 0 0 0 0 ...
105229 0 0 0 0 0 0 0 0 0 0 ...
105230 0 0 0 0 0 0 0 0 0 0 ...
105231 0 0 0 0 0 0 0 0 0 0 ...
105232 0 0 0 0 0 0 0 0 0 0 ...
23 24 25 26 \
0 0 0 0 0
1 395/543 0 0 0
2 0 0 0 0
3 29449/110942 0 0 0
4 0 0 0 0
5 41459/81005 0 0 0
6 0 0 0 0
7 4133206/15626431 0 0 0
8 0 0 0 0
9 0 0 0 0
10 41459/81005 0 0 0
11 6359221/17955721 0 0 0
12 0 0 41459/81005 0
13 0 0 6359221/17955721 0
14 0 0 0 0
15 4133206/15626431 0 0 0
16 0 0 4133206/15626431 0
17 0 0 0 0
18 0 0 0 0
19 41459/81005 0 0 0
20 6359221/17955721 0 0 0
21 0 0 41459/81005 0
22 0 0 6359221/17955721 0
23 0 0 0 0
24 4133206/15626431 0 0 0
25 0 0 4133206/15626431 0
26 0 0 0 0
27 0 0 0 0
28 41459/81005 0 0 0
29 6359221/17955721 0 0 0
... ... ... ... ..
105203 0 41459/81005 0 0
105204 0 6359221/17955721 0 0
105205 0 6359221/17955721 0 0
105206 0 0 41459/81005 0
105207 0 0 6359221/17955721 0
105208 0 0 6359221/17955721 0
105209 0 395/543 0 0
105210 0 23702/64201 0 0
105211 0 23702/64201 0 0
105212 0 0 395/543 0
105213 0 0 23702/64201 0
105214 0 0 23702/64201 0
105215 0 41459/81005 0 0
105216 0 6359221/17955721 0 0
105217 0 6359221/17955721 0 0
105218 0 0 41459/81005 0
105219 0 0 6359221/17955721 0
105220 0 0 6359221/17955721 0
105221 0 41459/81005 0 0
105222 0 6359221/17955721 0 0
105223 0 6359221/17955721 0 0
105224 0 0 41459/81005 0
105225 0 0 6359221/17955721 0
105226 0 0 6359221/17955721 0
105227 0 395/543 0 0
105228 0 23702/64201 0 0
105229 0 23702/64201 0 0
105230 0 0 395/543 0
105231 0 0 23702/64201 0
105232 0 0 23702/64201 0
27 28 29 30 31 \
0 0 0 0 0 0
1 0 0 0 0 0
2 0 57/74 0 0 0
3 0 63397/110942 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 49467/72995 0 0 0
7 0 7658739/15626431 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
11 0 0 0 0 0
12 0 0 0 0 0
13 0 0 0 0 0
14 0 49467/72995 0 0 0
15 0 7658739/15626431 0 0 0
16 0 7658739/15626431 0 0 0
17 0 0 0 0 0
18 0 0 0 0 0
19 0 0 0 0 0
20 0 0 0 0 0
21 0 0 0 0 0
22 0 0 0 0 0
23 0 49467/72995 0 0 0
24 0 7658739/15626431 0 0 0
25 0 7658739/15626431 0 0 0
26 0 0 0 0 106/243
27 0 0 0 0 16031/72995
28 0 0 0 0 3143/16201
29 0 0 0 0 2375174/17955721
... ... ... .. .. ...
105203 3143/16201 0 0 0 0
105204 2375174/17955721 0 0 0 0
105205 2375174/17955721 0 0 0 0
105206 3143/16201 0 0 0 0
105207 2375174/17955721 0 0 0 0
105208 2375174/17955721 0 0 0 0
105209 148/543 0 0 0 0
105210 17601/128402 0 0 0 0
105211 17601/128402 0 0 0 0
105212 148/543 0 0 0 0
105213 17601/128402 0 0 0 0
105214 17601/128402 0 0 0 0
105215 0 0 0 0 3143/16201
105216 0 0 0 0 2375174/17955721
105217 0 0 0 0 2375174/17955721
105218 0 0 0 0 3143/16201
105219 0 0 0 0 2375174/17955721
105220 0 0 0 0 2375174/17955721
105221 0 0 0 0 3143/16201
105222 0 0 0 0 2375174/17955721
105223 0 0 0 0 2375174/17955721
105224 0 0 0 0 3143/16201
105225 0 0 0 0 2375174/17955721
105226 0 0 0 0 2375174/17955721
105227 0 0 0 0 148/543
105228 0 0 0 0 17601/128402
105229 0 0 0 0 17601/128402
105230 0 0 0 0 148/543
105231 0 0 0 0 17601/128402
105232 0 0 0 0 17601/128402
32
0 0
1 0
2 0
3 0
4 137/243
5 23831/81005
6 7497/72995
7 1421917/15626431
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 137/243
18 7497/72995
19 23831/81005
20 1562587/17955721
21 23831/81005
22 1562587/17955721
23 7497/72995
24 1421917/15626431
25 1421917/15626431
26 0
27 0
28 0
29 0
... ...
105203 0
105204 0
105205 0
105206 0
105207 0
105208 0
105209 0
105210 0
105211 0
105212 0
105213 0
105214 0
105215 0
105216 0
105217 0
105218 0
105219 0
105220 0
105221 0
105222 0
105223 0
105224 0
105225 0
105226 0
105227 0
105228 0
105229 0
105230 0
105231 0
105232 0
[105233 rows x 32 columns]
}
不幸的是我不是很实际,我寻求帮助。 最好,尼古洛
我敢肯定,有一种更简单,更短,更优雅,更pythonic的方法来解决此问题,但是在此期间有一种解决方案。 它返回带有前16个术语与后16个术语相同的行的df。这里是一个行和列很少的示例:
df = pd.DataFrame({'a':[4,2,4,5,5,4],
'b':[4,3,1,2,2,4],
'c':[1,2,4,5,5,3],
'd': [4, 3, 2, 2, 2, 4],})
print df
a b c d
0 4 4 1 4
1 2 3 2 3
2 4 1 4 2
3 5 2 5 2
4 5 2 5 2
5 4 4 3 4
df_a = df.iloc[:,:2]
df_b = df.iloc[:,2:]
df_b.columns = df_a.columns
c = df_b-df_a
c = c.applymap(lambda x: True if x!=0 else False)
df_a = df_a.mask(c)
a = pd.isnull(df_a).any(1).nonzero()[0]
df = df.drop(df.index[a])
输出:
a b c d
1 2 3 2 3
3 5 2 5 2
4 5 2 5 2
在您的情况下:
df_a = df.iloc[:,:16]
df_b = df.iloc[:,16:]
感谢您的回答。 由于一个或另一个原因,它们都不起作用,但是它们很有用。 我找到了一个古老的解决方案,虽然不是很优雅,但是可以正常工作:
import pandas as pd
data = pd.read_csv('enummixed.txt', header = None, low_memory=False)
data = data.drop(data.columns[[0]],axis=1)
for i in data.index:
k=0
for j in range(0,15):
if (data.iloc[i,j]==data.iloc[i,j+16]) is True:
k+=1
if k==15:
print(data.loc[i], file=open("symmetric_ne.txt", "a"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.