[英]Python pandas return value from other column
我有一个文件“specieslist.txt”,其中包含以下信息:
Bacillus,genus
Borrelia,genus
Burkholderia,genus
Campylobacter,genus
现在,我希望python在第一列中查找变量(在此示例中为“Campylobacter”)并返回第二列(“genus”)的值。 我写了以下代码
import csv
import pandas as pd
species_import = 'Campylobacter'
df = pd.read_csv('specieslist.txt', header=None, names = ['species', 'level'] )
input = df.loc[df['species'] == species_import]
print (input['level'])
但是,我的代码返回太多,而我只想要“属”
3 genus
Name: level, dtype: object
您可以通过iat
选择Series的第一个值:
species_import = 'Campylobacter'
out = df.loc[df['species'] == species_import, 'level'].iat[0]
#alternative
#out = df.loc[df['species'] == species_import, 'level'].values[0]
print (out)
genus
如果没有值匹配且返回empty Series
则更好的解决方案工作 - 它返回no match
:
@jpp评论
只有当您拥有一个大型系列且匹配值预计接近顶部时,此解决方案才会更好
species_import = 'Campylobacter'
out = next(iter(df.loc[df['species'] == species_import, 'level']), 'no match')
print (out)
genus
编辑:
来自评论的想法,谢谢@jpp:
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
print (get_first_val(species_import))
genus
print (get_first_val('aaa'))
no match
编辑:
df = pd.DataFrame({'species':['a'] * 10000 + ['b'], 'level':np.arange(10001)})
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
In [232]: %timeit next(iter(df.loc[df['species'] == 'a', 'level']), 'no match')
1.3 ms ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [233]: %timeit (get_first_val('a'))
1.1 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [235]: %timeit (get_first_val('b'))
1.48 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [236]: %timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match')
1.24 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
表现各种方法,以证明next(...)
使用时有用next(...)
。
n = 10**6
df = pd.DataFrame({'species': ['b']+['a']*n, 'level': np.arange(n+1)})
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
%timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match') # 123 ms per loop
%timeit get_first_val('b') # 125 ms per loop
%timeit next(idx for idx, val in enumerate(df['species']) if val == 'b') # 20.3 µs per loop
get
使用pandas.Series.get
,如果'species'
是唯一的,则可以返回标量值,或者如果不是唯一的,则返回pandas.Series
。
f = df.set_index('species').level.get
f('Campylobacter')
'genus'
如果不在数据中,则可以提供默认值
f('X', 'Not In Data')
'Not In Data'
我们也可以使用dict.get
并只返回标量。 如果不是唯一的,这将返回最后一个。
f = dict(zip(df.species, df.level)).get
如果你想返回第一个,你可以通过几种方式做到这一点
f = dict(zip(df.species[::-1], df.level[::-1])).get
要么
f = df.drop_duplicates('species').pipe(
lambda d: dict(zip(d.species, d.level)).get
)
# Change the last line of your code to
print(input['level'].values)
# For Explanation refer below code
import csv
import pandas as pd
species_import = 'Campylobacter'
df = pd.read_csv('specieslist.txt', header=None, names = ['species', 'level'] )
input = df['species'] == species_import # return a pandas dataFrame
print(type(df[input])) # return a Pandas DataFrame
print(type(df[input]['level'])) # return a Pandas Series
# To obtain the value from this Series.
print(df[input]['level'].values) # return 'genus'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.