I have this dataset (105233 rows x 32 columns matrix), from which I deleted the first column with .drop. At this point what I should do is to analyze each row (an array of 32 components) and look for those the first 16 terms are equal to the last 16.
{
import pandas as pd
import numpy as np
data = pd.read_csv('enummixed.txt', header = None, low_memory=False)
data = data.drop(data.columns[[0]], axis=1)
print data
1 2 3 4 5 6 7 8 9 10 ... \
0 1 0 0 0 0 0 0 0 0 0 ...
1 1 0 0 0 0 0 0 0 0 0 ...
2 1 0 0 0 0 0 0 0 0 0 ...
3 1 0 0 0 0 0 0 0 0 0 ...
4 1 0 0 0 0 0 0 0 0 0 ...
5 1 0 0 0 0 0 0 0 0 0 ...
6 1 0 0 0 0 0 0 0 0 0 ...
7 1 0 0 0 0 0 0 0 0 0 ...
8 106/243 137/243 0 0 0 0 0 0 0 0 ...
9 106/243 137/243 0 0 0 0 0 0 0 0 ...
10 106/243 137/243 0 0 0 0 0 0 0 0 ...
11 106/243 137/243 0 0 0 0 0 0 0 0 ...
12 106/243 137/243 0 0 0 0 0 0 0 0 ...
13 106/243 137/243 0 0 0 0 0 0 0 0 ...
14 106/243 137/243 0 0 0 0 0 0 0 0 ...
15 106/243 137/243 0 0 0 0 0 0 0 0 ...
16 106/243 137/243 0 0 0 0 0 0 0 0 ...
17 106/243 137/243 0 0 0 0 0 0 0 0 ...
18 106/243 137/243 0 0 0 0 0 0 0 0 ...
19 106/243 137/243 0 0 0 0 0 0 0 0 ...
20 106/243 137/243 0 0 0 0 0 0 0 0 ...
21 106/243 137/243 0 0 0 0 0 0 0 0 ...
22 106/243 137/243 0 0 0 0 0 0 0 0 ...
23 106/243 137/243 0 0 0 0 0 0 0 0 ...
24 106/243 137/243 0 0 0 0 0 0 0 0 ...
25 106/243 137/243 0 0 0 0 0 0 0 0 ...
26 106/243 137/243 0 0 0 0 0 0 0 0 ...
27 106/243 137/243 0 0 0 0 0 0 0 0 ...
28 106/243 137/243 0 0 0 0 0 0 0 0 ...
29 106/243 137/243 0 0 0 0 0 0 0 0 ...
... ... ... .. .. .. .. .. .. .. .. ...
105203 0 0 0 0 0 0 0 0 0 0 ...
105204 0 0 0 0 0 0 0 0 0 0 ...
105205 0 0 0 0 0 0 0 0 0 0 ...
105206 0 0 0 0 0 0 0 0 0 0 ...
105207 0 0 0 0 0 0 0 0 0 0 ...
105208 0 0 0 0 0 0 0 0 0 0 ...
105209 0 0 0 0 0 0 0 0 0 0 ...
105210 0 0 0 0 0 0 0 0 0 0 ...
105211 0 0 0 0 0 0 0 0 0 0 ...
105212 0 0 0 0 0 0 0 0 0 0 ...
105213 0 0 0 0 0 0 0 0 0 0 ...
105214 0 0 0 0 0 0 0 0 0 0 ...
105215 0 0 0 0 0 0 0 0 0 0 ...
105216 0 0 0 0 0 0 0 0 0 0 ...
105217 0 0 0 0 0 0 0 0 0 0 ...
105218 0 0 0 0 0 0 0 0 0 0 ...
105219 0 0 0 0 0 0 0 0 0 0 ...
105220 0 0 0 0 0 0 0 0 0 0 ...
105221 0 0 0 0 0 0 0 0 0 0 ...
105222 0 0 0 0 0 0 0 0 0 0 ...
105223 0 0 0 0 0 0 0 0 0 0 ...
105224 0 0 0 0 0 0 0 0 0 0 ...
105225 0 0 0 0 0 0 0 0 0 0 ...
105226 0 0 0 0 0 0 0 0 0 0 ...
105227 0 0 0 0 0 0 0 0 0 0 ...
105228 0 0 0 0 0 0 0 0 0 0 ...
105229 0 0 0 0 0 0 0 0 0 0 ...
105230 0 0 0 0 0 0 0 0 0 0 ...
105231 0 0 0 0 0 0 0 0 0 0 ...
105232 0 0 0 0 0 0 0 0 0 0 ...
23 24 25 26 \
0 0 0 0 0
1 395/543 0 0 0
2 0 0 0 0
3 29449/110942 0 0 0
4 0 0 0 0
5 41459/81005 0 0 0
6 0 0 0 0
7 4133206/15626431 0 0 0
8 0 0 0 0
9 0 0 0 0
10 41459/81005 0 0 0
11 6359221/17955721 0 0 0
12 0 0 41459/81005 0
13 0 0 6359221/17955721 0
14 0 0 0 0
15 4133206/15626431 0 0 0
16 0 0 4133206/15626431 0
17 0 0 0 0
18 0 0 0 0
19 41459/81005 0 0 0
20 6359221/17955721 0 0 0
21 0 0 41459/81005 0
22 0 0 6359221/17955721 0
23 0 0 0 0
24 4133206/15626431 0 0 0
25 0 0 4133206/15626431 0
26 0 0 0 0
27 0 0 0 0
28 41459/81005 0 0 0
29 6359221/17955721 0 0 0
... ... ... ... ..
105203 0 41459/81005 0 0
105204 0 6359221/17955721 0 0
105205 0 6359221/17955721 0 0
105206 0 0 41459/81005 0
105207 0 0 6359221/17955721 0
105208 0 0 6359221/17955721 0
105209 0 395/543 0 0
105210 0 23702/64201 0 0
105211 0 23702/64201 0 0
105212 0 0 395/543 0
105213 0 0 23702/64201 0
105214 0 0 23702/64201 0
105215 0 41459/81005 0 0
105216 0 6359221/17955721 0 0
105217 0 6359221/17955721 0 0
105218 0 0 41459/81005 0
105219 0 0 6359221/17955721 0
105220 0 0 6359221/17955721 0
105221 0 41459/81005 0 0
105222 0 6359221/17955721 0 0
105223 0 6359221/17955721 0 0
105224 0 0 41459/81005 0
105225 0 0 6359221/17955721 0
105226 0 0 6359221/17955721 0
105227 0 395/543 0 0
105228 0 23702/64201 0 0
105229 0 23702/64201 0 0
105230 0 0 395/543 0
105231 0 0 23702/64201 0
105232 0 0 23702/64201 0
27 28 29 30 31 \
0 0 0 0 0 0
1 0 0 0 0 0
2 0 57/74 0 0 0
3 0 63397/110942 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 49467/72995 0 0 0
7 0 7658739/15626431 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
11 0 0 0 0 0
12 0 0 0 0 0
13 0 0 0 0 0
14 0 49467/72995 0 0 0
15 0 7658739/15626431 0 0 0
16 0 7658739/15626431 0 0 0
17 0 0 0 0 0
18 0 0 0 0 0
19 0 0 0 0 0
20 0 0 0 0 0
21 0 0 0 0 0
22 0 0 0 0 0
23 0 49467/72995 0 0 0
24 0 7658739/15626431 0 0 0
25 0 7658739/15626431 0 0 0
26 0 0 0 0 106/243
27 0 0 0 0 16031/72995
28 0 0 0 0 3143/16201
29 0 0 0 0 2375174/17955721
... ... ... .. .. ...
105203 3143/16201 0 0 0 0
105204 2375174/17955721 0 0 0 0
105205 2375174/17955721 0 0 0 0
105206 3143/16201 0 0 0 0
105207 2375174/17955721 0 0 0 0
105208 2375174/17955721 0 0 0 0
105209 148/543 0 0 0 0
105210 17601/128402 0 0 0 0
105211 17601/128402 0 0 0 0
105212 148/543 0 0 0 0
105213 17601/128402 0 0 0 0
105214 17601/128402 0 0 0 0
105215 0 0 0 0 3143/16201
105216 0 0 0 0 2375174/17955721
105217 0 0 0 0 2375174/17955721
105218 0 0 0 0 3143/16201
105219 0 0 0 0 2375174/17955721
105220 0 0 0 0 2375174/17955721
105221 0 0 0 0 3143/16201
105222 0 0 0 0 2375174/17955721
105223 0 0 0 0 2375174/17955721
105224 0 0 0 0 3143/16201
105225 0 0 0 0 2375174/17955721
105226 0 0 0 0 2375174/17955721
105227 0 0 0 0 148/543
105228 0 0 0 0 17601/128402
105229 0 0 0 0 17601/128402
105230 0 0 0 0 148/543
105231 0 0 0 0 17601/128402
105232 0 0 0 0 17601/128402
32
0 0
1 0
2 0
3 0
4 137/243
5 23831/81005
6 7497/72995
7 1421917/15626431
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 137/243
18 7497/72995
19 23831/81005
20 1562587/17955721
21 23831/81005
22 1562587/17955721
23 7497/72995
24 1421917/15626431
25 1421917/15626431
26 0
27 0
28 0
29 0
... ...
105203 0
105204 0
105205 0
105206 0
105207 0
105208 0
105209 0
105210 0
105211 0
105212 0
105213 0
105214 0
105215 0
105216 0
105217 0
105218 0
105219 0
105220 0
105221 0
105222 0
105223 0
105224 0
105225 0
105226 0
105227 0
105228 0
105229 0
105230 0
105231 0
105232 0
[105233 rows x 32 columns]
}
Unfortunately I am not very practical and I ask for help. Best, Nicolò
I am sure that there is a simpler, shorter, more elegant and more pythonic way to solve this, but in the while here there is a solution. It returns the df with the rows in which the first 16 terms are the same as the second 16. Here an example with few rows and columns:
df = pd.DataFrame({'a':[4,2,4,5,5,4],
'b':[4,3,1,2,2,4],
'c':[1,2,4,5,5,3],
'd': [4, 3, 2, 2, 2, 4],})
print df
a b c d
0 4 4 1 4
1 2 3 2 3
2 4 1 4 2
3 5 2 5 2
4 5 2 5 2
5 4 4 3 4
df_a = df.iloc[:,:2]
df_b = df.iloc[:,2:]
df_b.columns = df_a.columns
c = df_b-df_a
c = c.applymap(lambda x: True if x!=0 else False)
df_a = df_a.mask(c)
a = pd.isnull(df_a).any(1).nonzero()[0]
df = df.drop(df.index[a])
Output:
a b c d
1 2 3 2 3
3 5 2 5 2
4 5 2 5 2
In your case:
df_a = df.iloc[:,:16]
df_b = df.iloc[:,16:]
thanks for the answers. For one reason or another they both did not work, but they were useful. I found a solution, ancient, not very elegant, but working:
import pandas as pd
data = pd.read_csv('enummixed.txt', header = None, low_memory=False)
data = data.drop(data.columns[[0]],axis=1)
for i in data.index:
k=0
for j in range(0,15):
if (data.iloc[i,j]==data.iloc[i,j+16]) is True:
k+=1
if k==15:
print(data.loc[i], file=open("symmetric_ne.txt", "a"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.