Say I have a data frame with following columns,
df.head()
ref_loc ref_chr REF ALT coverage base
9532728 21 G [A] 1 A
9540473 21 C [G] 2 G
9540473 21 CTATT [C] 2 G
9540794 21 C [T] 1 A
9542965 21 C [A] 1 T
And I want to compare the column ALT
with column base
and see the matches and difference. Based on the matches and difference, I wanna generate a new column called cate
.
For that, I tried to use the following function,
def grouping(row):
if row['ALT'] == row['base']:
val = "same_variants"
elif row['ALT'] != row['base']:
val = "diff_variants"
return val
df["cate"] = df.apply(grouping,axis=0)
However, the function when tried to apply on the dataframe is throwing this error,
KeyError Traceback (most recent call last)
<ipython-input-13-a265dee72ec1> in <module>
----> 1 df["group"] =df.apply(grouping,axis=0)
~/software/anaconda/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6911 kwds=kwds,
6912 )
-> 6913 return op.get_result()
6914
6915 def applymap(self, func):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-11-098066170c2f> in grouping(row)
1 def grouping(row):
----> 2 if row['ALT'] == row['base']:
3 val = "same_variants"
4 elif row['ALT'] != row['base']:
5 val= "diff_variants"
~/software/anaconda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
1066 key = com.apply_if_callable(key, self)
1067 try:
-> 1068 result = self.index.get_value(self, key)
1069
1070 if not is_scalar(result):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4728 k = self._convert_scalar_indexer(k, kind="getitem")
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
4732 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: ('ALT', 'occurred at index ref_loc')
I would like to have some suggestions where I can go forward.
At the end, the output should look like the following,
ref_loc ref_chr REF ALT coverage base cate
9532728 21 G [A] 1 A same_variants
9540473 21 C [G] 2 G same_variants
9540473 21 CTATT [C] 2 G diff_variants
9540794 21 C [T] 1 A diff_variants
9542965 21 C [A] 1 T diff_variants
Note that since you have the square brackets around the ALT
column, it will always be different. You can first extract what is inside the brackets:
df["ALT"] = df.ALT.apply(lambda l: l[0])
You need to use axis=1
to iterate over the rows. axis=0
iterates over the columns.
df["cate"] = df.apply(grouping,axis=1)
print(df)
ref_loc ref_chr REF ALT coverage base cate
0 9532728 21 G A 1 A same_variants
1 9540473 21 C G 2 G same_variants
2 9540473 21 CTATT C 2 G diff_variants
3 9540794 21 C T 1 A diff_variants
4 9542965 21 C A 1 T diff_variants
You need to apply your function to each row:
df["cate"] = df.apply(grouping, axis=1)
If I understand correctly the ALT
column contains lists. So you need to acces the first element of each list:
def grouping(row):
if row['ALT'][0] == row['base']:
return "same_variants"
else:
return "diff_variants"
Alternatively, you can use the numpy
function where
:
df['cate'] = np.where(df['ALT'].str[0]==df['base'], 'same_variants', 'diff_variants')
Even though this is a different approach, I think it's worth saying: you can implement that with this one-liner:
df['cate'] = np.where(df['ALT'] == '['+df['base']+']', 'same_variants', 'diff_variants')
I tried to use format
in the right side of the comparison, but to no avail.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.