[英]selecting data using pandas
I have a large catalog that I am selecting data from according to the following criteria: 我有一个很大的目录,可以根据以下条件从中选择数据:
columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))
When I perform the above cuts the code is working fine. 当我执行上述切割时,代码工作正常。 Next, I want to add one more cut: I want to select all the targets that have
4.0 < logg < 5.0
. 接下来,我要添加一个剪切:我想选择所有
4.0 < logg < 5.0
的目标。 However, some of the targets have logg = -1
(which stands for the fact that the value is not available). 但是,某些目标的
logg = -1
(表示该值不可用的事实)。 Luckily, I can calculate logg
from the other available parameters. 幸运的是,我可以根据其他可用参数来计算
logg
。 So here is my updated cuts: 所以这是我最新的削减:
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)
However, I am receiving an error: if catalog.logg[i] == -1: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
但是,我收到一个错误:
if catalog.logg[i] == -1: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
if catalog.logg[i] == -1: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can someone please explain what I am doing wrong and how I can fix it. 有人可以解释我做错了什么以及如何解决。 Thank you
谢谢
My dataframe looks like the following: 我的数据框如下所示:
Data columns:
System 477 non-null values
rp 477 non-null values
mp 477 non-null values
logg 477 non-null values
dtypes: float64(37), int64(3), object(3)None
System rp mp logg FeH FeHu FeHl Mstar Mstaru Mstarl
0 target-01 5196 24 24 0.31 0.04 0.04 0.905 0.015 0.015
1 target-02 5950 150 150 -0.30 0.25 0.25 0.950 0.110 0.110
2 target-03 5598 50 50 0.04 0.05 0.05 0.997 0.049 0.049
3 target-04 6558 44 -1 0.14 0.04 0.04 1.403 0.061 0.061
4 target-05 6190 60 60 0.05 0.07 0.07 1.194 0.049 0.050
....
[5 rows x 43 columns]
My code in a format that I understand should be: 我理解的格式的代码应为:
for row in range(len(catalog)):
parameter = catalog['logg'][row]
if parameter == -1:
parameter = catalog['mp'][row] / catalog['rp'][row]
if parameter > 4.0 and parameter < 5.0:
# select this row for further analysis
However, I am trying to write my code in a more simple and professional way. 但是,我正在尝试以更简单和专业的方式编写代码。 I don't want to use the
for
loop. 我不想使用
for
循环。 How can I do it? 我该怎么做?
Consider the following small example: 考虑以下小示例:
System rp mp logg
target-01 2 -1 2 # will NOT be selected since mp = -1
target-02 -1 3 4 # will NOT be selected since rp = -1
target-03 7 6 4.3 # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04 3.2 15 -1 # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)
you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation: 因为目录.logg [i]不是标量,而是一个序列,所以会出现错误,因此应转向向量化操作:
catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']
which would modify the logg column inplace 这将修改logg列
As for edit 3: 至于编辑3:
rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]
which will select rows that satisfy the condition 这将选择满足条件的行
Instead of that code: 代替该代码:
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
You could use following: 您可以使用以下命令:
i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
For your edit 3 you need to add that line: 对于您的编辑3,您需要添加该行:
your_rows = df[(df.logg > 4) & (df.logg < 5)]
Full code: 完整代码:
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]
EDIT 编辑
Probably I still don't understand what you want, but I get your desired output: 也许我还是不明白你想要什么,但是我得到了你想要的输出:
import pandas as pd
from io import StringIO
data = """
System rp mp logg
target-01 2 -1 2
target-02 -1 3 4
target-03 7 6 4.3
target-04 3.2 15 -1
"""
catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]
In [7]: your_rows
Out[7]:
System rp mp logg
2 target-03 7.0 6 4.3000
3 target-04 3.2 15 4.6875
Am I still wrong? 我还是错吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.