如何从python pandas dataframe的特定列中检查缺失值？

Question

I have one dataframe like below - 我有一个像下面的数据框 -

               time              machine        message
6      2017-08-07 05:24:31,955    solid    qa  : 6502444
20     2017-08-07 05:24:32,024    solid    prod : 6502445
32     2017-08-07 05:24:32,033    solid    qa : 6502445
44     2017-08-07 05:24:32,041    solid    prod : 6502446
56     2017-08-07 05:24:32,055    solid    qa : 6502446
68     2017-08-07 05:24:32,063    solid    prod : 6502447
80     2017-08-07 05:24:32,071    solid    qa : 6502450
92     2017-08-07 05:24:32,079    solid    prod : 6502451
92     2017-08-07 05:24:32,079    solid    qa : 6502452
104    2017-08-07 05:24:32,086    solid    prod : 6502453
116    2017-08-07 05:24:32,094    liquid    qa : 700001
128    2017-08-07 05:24:32,101    liquid    prod : 700004
140    2017-08-07 05:24:32,108    liquid    qa : 700002
152    2017-08-07 05:24:32,115    liquid    prod : 700005
164    2017-08-07 05:24:32,126    liquid    qa : 700007
176    2017-08-07 05:24:32,133    liquid    prod : 700010
188    2017-08-07 05:24:32,140    liquid    qa : 700008
200    2017-08-07 05:24:32,147    liquid    prod : 700011

I want to know which numbers are missing in the message column with respect to qa and prod. 我想知道有关qa和prod的消息列中缺少哪些数字。

sample example - 示例示例-

machine columns - solid, In message column qa : 6502446 . 机器列-实体，在消息列qa中：6502446 。 After this row I'm expecting machine column - solid, In message column qa : 6502447 but it is not available means missing like that 6502448 is also missing. 在此行之后，我希望看到机器列-稳定，在消息列qa中：6502447，但该列不可用，意味着缺少，也缺少6502448。 Same for prod as well. 同样适用于产品。

I want to generate one output file in csv format which has header like below - 我想以csv格式生成一个输出文件，其标题如下所示-

machine,message_header,missing_number_size,start,end,start_time,end_time

Detailed Information - 详细资料 -

machine is solid or liquid (in this case ie if 3rd one available then that need to take in account). 机器是固态的还是液态的（在这种情况下，即如果有第三个，则需要考虑）。
message_header is available in the message column values. message_header在消息列值中可用。 It is fixed either qa or prod. 它是固定的qa或prod。
missing_number_size is the numbers which get missed in between for qa and prod as well. missing_number_size是qa和prod之间也遗漏的数字。
start is (last available number + 1) before missed from message column. 开始是（最后一个可用号码+ 1）从消息列错过的开始。
end is (first available number - 1) after missed from message column. 结束是消息列中未填写的（第一个可用数字-1）。
start_time is last available time before missed from time column. start_time是从时间列中错过的最后一个可用时间。
end_time is first available time after missed from time column. end_time是从time列错过之后的第一个可用时间。

Output.csv will look like this - Output.csv将如下所示-

machine,message_info,missin_number_size,start,end,start_time,end_time
solid,qa,3,6502447,6502449,2017-08-07 05:24:32,055,2017-08-07 05:24:32,071
solid,qa,1,6502451,6502451,2017-08-07 05:24:32,071,2017-08-07 05:24:32,079
solid,prod,3,6502448,6502450,2017-08-07 05:24:32,063,2017-08-07 05:24:32,079
solid,prod,1,6502452,6502452,2017-08-07 05:24:32,079,2017-08-07 05:24:32,086
liquid,qa,4,700003,700006,2017-08-07 05:24:32,108,2017-08-07 05:24:32,126
liquid,prod,4,700006,700009,2017-08-07 05:24:32,115,2017-08-07 05:24:32,133

Edit 1- Based on comment to answer what I did? 编辑1-根据评论回答我做了什么？

I did below experiment but its not pandas way. 我在下面的实验中做了，但不是熊猫方式。 Its taking time to print output result on console. 在控制台上打印输出结果需要花费时间。 I want pandas ways (fast) - 我想要熊猫方式（快速）-

Assumption in code - data is dictionary which contains value as a whole dataframe like above. 代码中的假设 -数据是字典，其中包含像上面一样的整个数据帧的值。

for key, value in data.iteritems():
    prev_qa_no = 0
    prev_prod_no = 0
    prev_time = ""
    total_count = 0
    flag = False
    qa = value[value['message'].str.contains("qa")]
    prod = value[value['message'].str.contains("prod")]
    qa['qa'] = qa['message'].apply(remove_name)
    del qa['message']
    qa.sort_values('qa',inplace=True)
    for index, row in qa.iterrows():
        time = row['time']
        feed = row['feed']
        qa_no = int(row['qa'])
        if flag:
            if (qa_no - prev_qa_no) > 1:
                diff=qa_no-prev_qa_no+1
                print str(feed.strip())+",qa,"+str(diff)+","+str(prev_qa_no+1)+","+str(qa_no-1)+","+str(prev_time.strip())+","+str(time)

        flag = True
        prev_time = time
        prev_qa_no = qa_no

    prev_time = ""
    total_count = 0
    prod['prod'] = prod['message'].apply(remove_name)
    del prod['message']
    prod.sort_values('prod',inplace=True)
    flag = False
    for index, row in prod.iterrows():
        time = row['time']
        feed = row['feed']
        prod_no = int(row['prod'])
        if flag:
            if (prod_no - prev_prod_no) > 1:
                diff=prod_no-prev_prod_no+1
                print str(feed.strip())+",prod,"+str(diff)+","+str(prev_prod_no+1)+","+str(prod_no-1)+","+str(prev_time.strip())+","+str(time)

        flag = True
        prev_time = time
        prev_prod_no = prod_no

Please fill free to ask me if you have any doubt while understanding this question. 如果您对这个问题有任何疑问，请随时问我。

Answer 1

Let's call your dataframe df (I reduced to only the message column because I'm lazy): 让我们将您的数据框称为df （由于懒惰，我将其简化为message列）：

# Stuff to get a similar dataframe####################
from io import StringIO
import pandas as pd
frame = """message
qa  : 6502444
prod : 6502445
qa : 6502445
prod : 6502446
qa : 6502446
prod : 6502447"""
df = pd.read_csv(io.StringIO(frame),header=0)
# Stuff to get a similar dataframe END################
#split the message
df2=df['message'].apply(lambda m : m.split(' : '))
#group by message type (qa or prod)
df = pd.DataFrame(df2.tolist(),columns=['prodqa','message'])
gb = df.groupby('prodqa')
qa = gb.get_group('qa')
prod = gb.get_group('prod')

You can know check what is missing in qa and in prod. 您可以知道检查qa和prod中缺少的内容。

I haven't fully understand how you want to do that, but if it is to see if each "qa" match a "prod", you can apply set functions, eg if you want to see what is in prod and not in qa: 我还没有完全了解您要怎么做，但是如果要查看每个“ qa”是否匹配一个“ prod”，则可以应用集合函数，例如，如果要查看prod而不是qa中的内容：

missing_in_qa=prod['message'][~prod['message'].isin(qa['message'])]

如何从python pandas dataframe的特定列中检查缺失值？

问题描述

1 个解决方案

解决方案1
0 2017-08-09 07:58:57

如何从python pandas dataframe的特定列中检查缺失值？

问题描述

1 个解决方案

解决方案1 0 2017-08-09 07:58:57

解决方案1
0 2017-08-09 07:58:57