I have a DataFrame kinda like this:
| index | col_1 | col_2 |
| 0 | A | 11 |
| 1 | B | 12 |
| 2 | B | 12 |
| 3 | C | 13 |
| 4 | C | 13 |
| 5 | C | 14 |
where col_1
and col_2
may not always be one-to-one due to corrupt data.
How can I use Pandas to determine which rows have col_1
and col_2
entries that violate this one-to-one relationship?
In this case it would be the last three rows since C can either map to 13 or 14.
You could use a transform, counting the length of unique objects in each group. First look at the subset of just these columns, and then groupby a single column:
In [11]: g = df[['col1', 'col2']].groupby('col1')
In [12]: counts = g.transform(lambda x: len(x.unique()))
In [13]: counts
Out[13]:
col2
0 1
1 1
2 1
3 2
4 2
5 2
The columns for the remaining columns (if not all)
In [14]: (counts == 1).all(axis=1)
Out[14]:
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
I tested the g.transform(lambda x: len(x.unique())), works good but is slow especially when there are a lot of groups. The code below works much faster so I put it here.
df2 = pd.DataFrame(df[['col1', 'col2']].groupby(['col1','col2']).size(),columns=['count'])
df2.reset_index(inplace=True)
df3 = pd.DataFrame(df2.groupby('col1').size(), columns=['count'])
df4 = df3[df3['count']>1]
df_copy = df.copy()
df_copy.set_index('col1', inplace=True)
df_outlier = df_copy.ix[df4.index]
I would use a collections.Counter
, because more than one instance of each item in a column violates the one-to-one mapping:
>>> import pandas
>>> import numpy
>>> import collections
>>> df = pandas.DataFrame(numpy.array([['a', 1],['b', 2], ['b', 3], ['c', 3]]))
>>> df
0 1
0 a 1
1 b 2
2 b 3
3 c 3
>>> collections.Counter(df[0])
Counter({'b': 2, 'a': 1, 'c': 1})
>>> violations1 = [k for k, v in collections.Counter(df[0]).items() if v > 1]
>>> violations2 = [k for k, v in collections.Counter(df[1]).items() if v > 1]
>>> violations1
['b']
>>> violations2
['3']
Im super new to python but found a way to do it by gathering all the unique groupings into a list and filtering out the ones that were not uniquely mapped:
data = pd.DataFrame({'Col_1': ['A', 'B', 'B', 'C', 'C', 'C'], 'Col_2': [11,12,12,13,13,14]})
combos = []
for x, y in enumerate(range(len(data['Col_1']))):
combo = '%s_%s' %(data['Col_1'][x], data['Col_2'][x])
combos.append(combo)
data.index = data['Col_1']
for item in combos:
if len([comb for comb in combos if item[2:] in comb[2:]]) != len([comb for comb in combos if item[0] in comb[0]]):
data = data.drop(item[0])
data.reset_index(drop=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.