简体   繁体   English

尝试计算 dataframe 列子集中的 NaN 时出现 Pandas 类型错误

[英]Pandas TypeError when trying to count NaNs in subset of dataframe column

I'm writing a script to perform LLoD analysis for qPCR assays for my lab.我正在编写一个脚本来为我的实验室执行 qPCR 分析的 LLoD 分析。 I import the relevant columns from the.csv of data from the instrument using pandas.read_csv() with the usecols parameter, make a list of the unique values of RNA quantity/concentration column, and then I need to determine the detection rate / hit rate at each given concentration.我使用带有usecols参数的pandas.read_csv()从仪器数据的.csv中导入相关列,列出RNA数量/浓度列的唯一值,然后我需要确定检测率/命中率在每个给定浓度下的速率。 If the target is detected, the result will be a number;如果检测到目标,结果将是一个数字; if not, it'll be listed as "TND" or "Undetermined" or some other non-numeric string (depends on the instrument).如果不是,它将被列为“TND”或“Undetermined”或其他一些非数字字符串(取决于仪器)。 So I wrote a function that (should) take a quantity and the dataframe of results and return the probability of detection for that quantity.所以我写了一个 function (应该)取一个数量和 dataframe 结果并返回该数量的检测概率。 However, on running the script, I get the following error:但是,在运行脚本时,我收到以下错误:

Traceback (most recent call last):
  File "C:\Python\llod_custom.py", line 34, in <module>
    prop[idx] = hitrate(val, data)
  File "C:\Python\llod_custom.py", line 29, in hitrate
    df = pd.to_numeric(list[:,1], errors='coerce').isna()
  File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\wmacturk\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 75, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), 1)' is an invalid key

The idea in the line that's throwing the error ( df = pd.to_numeric(list[:,1], errors='coerce').isna() ) is to change any non-numeric values in the column to NaN, then get a boolean array telling me whether a given row's entry is NaN, so I can count the number of numeric entries with df.sum() later.抛出错误的行中的想法( df = pd.to_numeric(list[:,1], errors='coerce').isna() )是将列中的任何非数字值更改为 NaN,然后得到一个 boolean 数组告诉我给定行的条目是否为 NaN,因此我可以稍后使用 df.sum() 计算数字条目的数量。 I'm sure it's something that should be obvious to anyone who's worked with pandas / dataframes, but I haven't used dataframes in python before, so I'm at a loss.我敢肯定,对于任何使用过 pandas / 数据帧的人来说,这应该是显而易见的,但我之前没有在 python 中使用过数据帧,所以我不知所措。 I'm also much more familiar with C and JavaScript, so something like python that isn't as rigid can actually be a bit confusing since it's so flexible.我也更熟悉 C 和 JavaScript,所以像 python 这样不那么严格的东西实际上可能有点令人困惑,因为它非常灵活。 Any help would be greatly appreciated.任何帮助将不胜感激。

NB the conc column will consist of 5 to 10 different values, each repeated 5-10 times (ie 5-10 replicates at each of the 5-10 concentrations);注意conc列将包含 5 到 10 个不同的值,每个重复 5-10 次(即在 5-10 个浓度下重复 5-10 次); the detect column will contain either a number or a character string in each row -- numbers mean success, strings mean failure... For my purposes the value of the numbers is irrelevant, I only need to know if the target was detected or not for a given replicate. detect列将在每一行中包含一个数字或一个字符串 - 数字表示成功,字符串表示失败......出于我的目的,数字的值无关紧要,我只需要知道是否检测到目标对于给定的复制。 My script (up to this point) follows:我的脚本(到目前为止)如下:

import os
import pandas as pd
import numpy as np
import statsmodels as sm
from scipy.stats import norm
from tkinter import filedialog
from tkinter import *

# initialize tkinter
root = Tk()
root.withdraw()


# prompt for data file and column headers, then read those columns into a dataframe
print("In the directory prompt, select the .csv file containing data for analysis")
path = filedialog.askopenfilename()

conc = input("Enter the column header for concentration/number of copies: ")
detect = input("Enter the column header for target detection: ")
tnd = input("Enter the value listed when a target is not detected (e.g. \"TND\", \"Undetected\", etc.): ")

data = pd.read_csv(path, usecols=[conc, detect])

# create list of unique values for quantity of RNA, initialize vectors of same length
# to store probabilies and probit scores for each
qtys = data[conc].unique()
prop = probit = [0] * len(qtys)

# Function to get the hitrate/probability of detection for a given quantity
def hitrate(qty, dataFrame):
    list = dataFrame[dataFrame.iloc[:,0] == qty]
    df = pd.to_numeric(list[:,1], errors='coerce').isna()
    return (len(df) - (len(df)-df.sum()))/len(df)

# iterate over quantities to calculate the corresponding probability of Detection
# and its associate probit score
for idx, val in enumerate(qtys):
    prop[idx] = hitrate(val, data)
    probit[idx] = norm.ppf(hitrate(val, data))

# create an array of the quantities with their associated probabilities & Probit scores
hitTable = vstack([qtys,prop,probit])

sample dataframe can be created with:样品 dataframe 可以创建:

d = {'qty':[1,1,1,1,1, 10,10,10,10,10, 20,20,20,20,20, 50,50,50,50,50, 100,100,100,100,100], 'result':['TND','TND','TND',5,'TND', 'TND',5,'TND',5,'TND', 5,'TND',5,'TND',5, 5,6,5,5,'TND', 5,5,5,5,5]}
exData = pd.DataFrame(data=d)

Then just use exData as the dataframe data in the original code然后就用exData作为原代码中的dataframe data

EDIT: I've fixed the problem by tweaking Loic RW's answer slightly.编辑:我通过稍微调整 Loic RW 的答案解决了这个问题。 The function hitrate should be hitrate命中率应为

def hitrate(qty, df):
    t_s = df[df.qty == qty].result
    t_s = t_s.apply(pd.to_numeric, args=('coerce',)).isna()
    return (len(t_s)-t_s.sum())/len(t_s)

Does the following achieve what you want?以下是否达到您想要的? I made some assumptions on the structure of your data.我对您的数据结构做了一些假设。

def hitrate(qty, df):
    target_subset = df[df.qty == qty].target
    target_subset = target_subset.apply(pd.to_numeric, args=('coerce',)).isna()
    return 1-((target_subset.sum())/len(target_subset))

If i run the following:如果我运行以下命令:

data = pd.DataFrame({'qty': [1,2,2,2,3],
                     'target': [.5, .8, 'TND', 'Undetermined', .99]})
hitrate(2, data)

I get: 0.33333333333333337我得到: 0.33333333333333337

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM