简体   繁体   English

如何从python pandas dataframe的特定列中检查缺失值?

[英]How to check missing values from particular column of python pandas dataframe?

I have one dataframe like below - 我有一个像下面的数据框 -

               time              machine        message
6      2017-08-07 05:24:31,955    solid    qa  : 6502444
20     2017-08-07 05:24:32,024    solid    prod : 6502445
32     2017-08-07 05:24:32,033    solid    qa : 6502445
44     2017-08-07 05:24:32,041    solid    prod : 6502446
56     2017-08-07 05:24:32,055    solid    qa : 6502446
68     2017-08-07 05:24:32,063    solid    prod : 6502447
80     2017-08-07 05:24:32,071    solid    qa : 6502450
92     2017-08-07 05:24:32,079    solid    prod : 6502451
92     2017-08-07 05:24:32,079    solid    qa : 6502452
104    2017-08-07 05:24:32,086    solid    prod : 6502453
116    2017-08-07 05:24:32,094    liquid    qa : 700001
128    2017-08-07 05:24:32,101    liquid    prod : 700004
140    2017-08-07 05:24:32,108    liquid    qa : 700002
152    2017-08-07 05:24:32,115    liquid    prod : 700005
164    2017-08-07 05:24:32,126    liquid    qa : 700007
176    2017-08-07 05:24:32,133    liquid    prod : 700010
188    2017-08-07 05:24:32,140    liquid    qa : 700008
200    2017-08-07 05:24:32,147    liquid    prod : 700011

I want to know which numbers are missing in the message column with respect to qa and prod. 我想知道有关qa和prod的消息列中缺少哪些数字。

sample example - 示例示例-

machine columns - solid, In message column qa : 6502446 . 机器列-实体,在消息列qa中:6502446 After this row I'm expecting machine column - solid, In message column qa : 6502447 but it is not available means missing like that 6502448 is also missing. 在此行之后,我希望看到机器列-稳定,在消息列qa中:6502447,但该列不可用,意味着缺少,也缺少6502448。 Same for prod as well. 同样适用于产品

I want to generate one output file in csv format which has header like below - 我想以csv格式生成一个输出文件,其标题如下所示-

machine,message_header,missing_number_size,start,end,start_time,end_time

Detailed Information - 详细资料 -

  • machine is solid or liquid (in this case ie if 3rd one available then that need to take in account). 机器是固态的还是液态的(在这种情况下,即如果有第三个,则需要考虑)。
  • message_header is available in the message column values. message_header在消息列值中可用。 It is fixed either qa or prod. 它是固定的qa或prod。
  • missing_number_size is the numbers which get missed in between for qa and prod as well. missing_number_size是qa和prod之间也遗漏的数字。
  • start is (last available number + 1) before missed from message column. 开始是(最后一个可用号码+ 1)从消息列错过的开始。
  • end is (first available number - 1) after missed from message column. 结束是消息列中未填写的(第一个可用数字-1)。
  • start_time is last available time before missed from time column. start_time是从时间列中错过的最后一个可用时间。
  • end_time is first available time after missed from time column. end_time是从time列错过之后的第一个可用时间。

Output.csv will look like this - Output.csv将如下所示-

machine,message_info,missin_number_size,start,end,start_time,end_time
solid,qa,3,6502447,6502449,2017-08-07 05:24:32,055,2017-08-07 05:24:32,071
solid,qa,1,6502451,6502451,2017-08-07 05:24:32,071,2017-08-07 05:24:32,079
solid,prod,3,6502448,6502450,2017-08-07 05:24:32,063,2017-08-07 05:24:32,079
solid,prod,1,6502452,6502452,2017-08-07 05:24:32,079,2017-08-07 05:24:32,086
liquid,qa,4,700003,700006,2017-08-07 05:24:32,108,2017-08-07 05:24:32,126
liquid,prod,4,700006,700009,2017-08-07 05:24:32,115,2017-08-07 05:24:32,133

Edit 1- Based on comment to answer what I did? 编辑1-根据评论回答我做了什么?

I did below experiment but its not pandas way. 我在下面的实验中做了,但不是熊猫方式。 Its taking time to print output result on console. 在控制台上打印输出结果需要花费时间。 I want pandas ways (fast) - 我想要熊猫方式(快速)-

Assumption in code - data is dictionary which contains value as a whole dataframe like above. 代码中的假设 -数据是字典,其中包含像上面一样的整个数据帧的值。

for key, value in data.iteritems():
    prev_qa_no = 0
    prev_prod_no = 0
    prev_time = ""
    total_count = 0
    flag = False
    qa = value[value['message'].str.contains("qa")]
    prod = value[value['message'].str.contains("prod")]
    qa['qa'] = qa['message'].apply(remove_name)
    del qa['message']
    qa.sort_values('qa',inplace=True)
    for index, row in qa.iterrows():
        time = row['time']
        feed = row['feed']
        qa_no = int(row['qa'])
        if flag:
            if (qa_no - prev_qa_no) > 1:
                diff=qa_no-prev_qa_no+1
                print str(feed.strip())+",qa,"+str(diff)+","+str(prev_qa_no+1)+","+str(qa_no-1)+","+str(prev_time.strip())+","+str(time)

        flag = True
        prev_time = time
        prev_qa_no = qa_no

    prev_time = ""
    total_count = 0
    prod['prod'] = prod['message'].apply(remove_name)
    del prod['message']
    prod.sort_values('prod',inplace=True)
    flag = False
    for index, row in prod.iterrows():
        time = row['time']
        feed = row['feed']
        prod_no = int(row['prod'])
        if flag:
            if (prod_no - prev_prod_no) > 1:
                diff=prod_no-prev_prod_no+1
                print str(feed.strip())+",prod,"+str(diff)+","+str(prev_prod_no+1)+","+str(prod_no-1)+","+str(prev_time.strip())+","+str(time)

        flag = True
        prev_time = time
        prev_prod_no = prod_no

Please fill free to ask me if you have any doubt while understanding this question. 如果您对这个问题有任何疑问,请随时问我。

Let's call your dataframe df (I reduced to only the message column because I'm lazy): 让我们将您的数据框称为df (由于懒惰,我将其简化为message列):

# Stuff to get a similar dataframe####################
from io import StringIO
import pandas as pd
frame = """message
qa  : 6502444
prod : 6502445
qa : 6502445
prod : 6502446
qa : 6502446
prod : 6502447"""
df = pd.read_csv(io.StringIO(frame),header=0)
# Stuff to get a similar dataframe END################
#split the message
df2=df['message'].apply(lambda m : m.split(' : '))
#group by message type (qa or prod)
df = pd.DataFrame(df2.tolist(),columns=['prodqa','message'])
gb = df.groupby('prodqa')
qa = gb.get_group('qa')
prod = gb.get_group('prod')

You can know check what is missing in qa and in prod. 您可以知道检查qa和prod中缺少的内容。

I haven't fully understand how you want to do that, but if it is to see if each "qa" match a "prod", you can apply set functions, eg if you want to see what is in prod and not in qa: 我还没有完全了解您要怎么做,但是如果要查看每个“ qa”是否匹配一个“ prod”,则可以应用集合函数,例如,如果要查看prod而不是qa中的内容:

missing_in_qa=prod['message'][~prod['message'].isin(qa['message'])]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在特定行级别计算 Pandas DataFrame 中的 NaN 或缺失值? - How to count NaN or missing values in Pandas DataFrame at a particular row level? 如何在Pandas DataFrame中检查列值的类型 - How to check a type of column values in pandas DataFrame 如何计算python pandas中的特定列值? - How to count particular column values in python pandas? 检查数据框中的列是否缺少值 - Check if column in dataframe is missing values 如何从 Pandas Python 中 DataFrame 中的列中的字符串中提取一些值? - How to extract some values from string in column in DataFrame in Pandas Python? 如何在python中的pandas中添加具有缺失值的列 - How to add a column with missing values in pandas in python 如何在Pandas DATAFRAME中查找具有特定数据类型的列值的行 - How to find rows with column values having a particular datatype in a Pandas DATAFRAME 如何从 pandas dataframe 中特定列的所有值中删除所有非数字字符? - How can I remove all non-numeric characters from all the values in a particular column in pandas dataframe? 如何在Python Pandas DataFrame中切片列值 - How to slice column values in Python pandas DataFrame 如何用 python 中另一个 dataframe 的值仅填充一个 dataframe 列中的缺失值? - How to fill only missing values in one dataframe column with values from another dataframe in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM