[英]How to query between the values of two columns of a data frame
假设我有一个包含以下列的数据框,
df.head()
ref_loc ref_chr REF ALT coverage base
9532728 21 G [A] 1 A
9540473 21 C [G] 2 G
9540473 21 CTATT [C] 2 G
9540794 21 C [T] 1 A
9542965 21 C [A] 1 T
我想将列ALT
与列base
进行比较,看看匹配和差异。 基于匹配和差异,我想生成一个名为cate
的新列。
为此,我尝试使用以下 function,
def grouping(row):
if row['ALT'] == row['base']:
val = "same_variants"
elif row['ALT'] != row['base']:
val = "diff_variants"
return val
df["cate"] = df.apply(grouping,axis=0)
但是,function 尝试在 dataframe 上应用时会抛出此错误,
KeyError Traceback (most recent call last)
<ipython-input-13-a265dee72ec1> in <module>
----> 1 df["group"] =df.apply(grouping,axis=0)
~/software/anaconda/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6911 kwds=kwds,
6912 )
-> 6913 return op.get_result()
6914
6915 def applymap(self, func):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
~/software/anaconda/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-11-098066170c2f> in grouping(row)
1 def grouping(row):
----> 2 if row['ALT'] == row['base']:
3 val = "same_variants"
4 elif row['ALT'] != row['base']:
5 val= "diff_variants"
~/software/anaconda/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
1066 key = com.apply_if_callable(key, self)
1067 try:
-> 1068 result = self.index.get_value(self, key)
1069
1070 if not is_scalar(result):
~/software/anaconda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4728 k = self._convert_scalar_indexer(k, kind="getitem")
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
4732 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()
KeyError: ('ALT', 'occurred at index ref_loc')
我想在 go 前进的地方提出一些建议。
最后,output 应该如下所示,
ref_loc ref_chr REF ALT coverage base cate
9532728 21 G [A] 1 A same_variants
9540473 21 C [G] 2 G same_variants
9540473 21 CTATT [C] 2 G diff_variants
9540794 21 C [T] 1 A diff_variants
9542965 21 C [A] 1 T diff_variants
请注意,由于ALT
列周围有方括号,所以它总是不同的。 您可以先提取括号内的内容:
df["ALT"] = df.ALT.apply(lambda l: l[0])
您需要使用axis=1
来遍历行。 axis=0
遍历列。
df["cate"] = df.apply(grouping,axis=1)
print(df)
ref_loc ref_chr REF ALT coverage base cate
0 9532728 21 G A 1 A same_variants
1 9540473 21 C G 2 G same_variants
2 9540473 21 CTATT C 2 G diff_variants
3 9540794 21 C T 1 A diff_variants
4 9542965 21 C A 1 T diff_variants
您需要将 function 应用于每一行:
df["cate"] = df.apply(grouping, axis=1)
如果我理解正确, ALT
列包含列表。 所以你需要访问每个列表的第一个元素:
def grouping(row):
if row['ALT'][0] == row['base']:
return "same_variants"
else:
return "diff_variants"
或者,您可以使用numpy
function where
:
df['cate'] = np.where(df['ALT'].str[0]==df['base'], 'same_variants', 'diff_variants')
尽管这是一种不同的方法,但我认为值得一提的是:您可以使用以下单行代码实现它:
df['cate'] = np.where(df['ALT'] == '['+df['base']+']', 'same_variants', 'diff_variants')
我尝试在比较右侧使用format
,但无济于事。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.