[英]How to check missing values from particular column of python pandas dataframe?
I have one dataframe like below - 我有一个像下面的数据框 -
time machine message
6 2017-08-07 05:24:31,955 solid qa : 6502444
20 2017-08-07 05:24:32,024 solid prod : 6502445
32 2017-08-07 05:24:32,033 solid qa : 6502445
44 2017-08-07 05:24:32,041 solid prod : 6502446
56 2017-08-07 05:24:32,055 solid qa : 6502446
68 2017-08-07 05:24:32,063 solid prod : 6502447
80 2017-08-07 05:24:32,071 solid qa : 6502450
92 2017-08-07 05:24:32,079 solid prod : 6502451
92 2017-08-07 05:24:32,079 solid qa : 6502452
104 2017-08-07 05:24:32,086 solid prod : 6502453
116 2017-08-07 05:24:32,094 liquid qa : 700001
128 2017-08-07 05:24:32,101 liquid prod : 700004
140 2017-08-07 05:24:32,108 liquid qa : 700002
152 2017-08-07 05:24:32,115 liquid prod : 700005
164 2017-08-07 05:24:32,126 liquid qa : 700007
176 2017-08-07 05:24:32,133 liquid prod : 700010
188 2017-08-07 05:24:32,140 liquid qa : 700008
200 2017-08-07 05:24:32,147 liquid prod : 700011
I want to know which numbers are missing in the message column with respect to qa and prod. 我想知道有关qa和prod的消息列中缺少哪些数字。
sample example - 示例示例-
machine columns - solid, In message column qa : 6502446 . 机器列-实体,在消息列qa中:6502446 。 After this row I'm expecting machine column - solid, In message column qa : 6502447 but it is not available means missing like that 6502448 is also missing. 在此行之后,我希望看到机器列-稳定,在消息列qa中:6502447,但该列不可用,意味着缺少,也缺少6502448。 Same for prod as well. 同样适用于产品 。
I want to generate one output file in csv format which has header like below - 我想以csv格式生成一个输出文件,其标题如下所示-
machine,message_header,missing_number_size,start,end,start_time,end_time
Detailed Information - 详细资料 -
Output.csv will look like this - Output.csv将如下所示-
machine,message_info,missin_number_size,start,end,start_time,end_time
solid,qa,3,6502447,6502449,2017-08-07 05:24:32,055,2017-08-07 05:24:32,071
solid,qa,1,6502451,6502451,2017-08-07 05:24:32,071,2017-08-07 05:24:32,079
solid,prod,3,6502448,6502450,2017-08-07 05:24:32,063,2017-08-07 05:24:32,079
solid,prod,1,6502452,6502452,2017-08-07 05:24:32,079,2017-08-07 05:24:32,086
liquid,qa,4,700003,700006,2017-08-07 05:24:32,108,2017-08-07 05:24:32,126
liquid,prod,4,700006,700009,2017-08-07 05:24:32,115,2017-08-07 05:24:32,133
Edit 1- Based on comment to answer what I did? 编辑1-根据评论回答我做了什么?
I did below experiment but its not pandas way. 我在下面的实验中做了,但不是熊猫方式。 Its taking time to print output result on console. 在控制台上打印输出结果需要花费时间。 I want pandas ways (fast) - 我想要熊猫方式(快速)-
Assumption in code - data is dictionary which contains value as a whole dataframe like above. 代码中的假设 -数据是字典,其中包含像上面一样的整个数据帧的值。
for key, value in data.iteritems():
prev_qa_no = 0
prev_prod_no = 0
prev_time = ""
total_count = 0
flag = False
qa = value[value['message'].str.contains("qa")]
prod = value[value['message'].str.contains("prod")]
qa['qa'] = qa['message'].apply(remove_name)
del qa['message']
qa.sort_values('qa',inplace=True)
for index, row in qa.iterrows():
time = row['time']
feed = row['feed']
qa_no = int(row['qa'])
if flag:
if (qa_no - prev_qa_no) > 1:
diff=qa_no-prev_qa_no+1
print str(feed.strip())+",qa,"+str(diff)+","+str(prev_qa_no+1)+","+str(qa_no-1)+","+str(prev_time.strip())+","+str(time)
flag = True
prev_time = time
prev_qa_no = qa_no
prev_time = ""
total_count = 0
prod['prod'] = prod['message'].apply(remove_name)
del prod['message']
prod.sort_values('prod',inplace=True)
flag = False
for index, row in prod.iterrows():
time = row['time']
feed = row['feed']
prod_no = int(row['prod'])
if flag:
if (prod_no - prev_prod_no) > 1:
diff=prod_no-prev_prod_no+1
print str(feed.strip())+",prod,"+str(diff)+","+str(prev_prod_no+1)+","+str(prod_no-1)+","+str(prev_time.strip())+","+str(time)
flag = True
prev_time = time
prev_prod_no = prod_no
Please fill free to ask me if you have any doubt while understanding this question. 如果您对这个问题有任何疑问,请随时问我。
Let's call your dataframe df
(I reduced to only the message column because I'm lazy): 让我们将您的数据框称为df
(由于懒惰,我将其简化为message列):
# Stuff to get a similar dataframe####################
from io import StringIO
import pandas as pd
frame = """message
qa : 6502444
prod : 6502445
qa : 6502445
prod : 6502446
qa : 6502446
prod : 6502447"""
df = pd.read_csv(io.StringIO(frame),header=0)
# Stuff to get a similar dataframe END################
#split the message
df2=df['message'].apply(lambda m : m.split(' : '))
#group by message type (qa or prod)
df = pd.DataFrame(df2.tolist(),columns=['prodqa','message'])
gb = df.groupby('prodqa')
qa = gb.get_group('qa')
prod = gb.get_group('prod')
You can know check what is missing in qa and in prod. 您可以知道检查qa和prod中缺少的内容。
I haven't fully understand how you want to do that, but if it is to see if each "qa" match a "prod", you can apply set functions, eg if you want to see what is in prod and not in qa: 我还没有完全了解您要怎么做,但是如果要查看每个“ qa”是否匹配一个“ prod”,则可以应用集合函数,例如,如果要查看prod而不是qa中的内容:
missing_in_qa=prod['message'][~prod['message'].isin(qa['message'])]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.