简体   繁体   English

如何根据 Pandas dataframe 中的某些列仅保留最后一个条目?

[英]How to keep only the last entry based on certain columns in a Pandas dataframe?

I have a dataframe like below with the following columns - TEST_NUM, SITE_NUM, HEAD_NUM, RESULT我有一个 dataframe ,如下所示 - TEST_NUM、SITE_NUM、HEAD_NUM、RESULT

Here is sample data in it -这是其中的示例数据-

________________________________________ 
TEST_NUM | HEAD_NUM | SITE_NUM | RESULT
________________________________________ 
10000   |  1          | 0       |  P
________________________________________ 
10000   |  1          | 1       |  F          --> Should be retest, as result is F
________________________________________ 
10000   |  1          | 2       |  F          ---> Should be retest, as result is F
________________________________________ 
10000   |  1          | 3       |  P 
________________________________________ 
10000   |  1          | 1       |  P          ----> Retest done, finally Pass
________________________________________ 
10000   |  1          |  2      |  P          ----> Retest done, finally Pass

The above data is data from a testing device that works on 4 sites {0,1,2,3}.上述数据是来自在 4 个站点 {0,1,2,3} 上工作的测试设备的数据。 As you can see in the above dataframe, if a site fails, a retest is done, where it can be passed or can still fail.正如您在上面的 dataframe 中看到的那样,如果站点失败,则会进行重新测试,它可以通过或仍然失败。 If the failure happens again, again the site is retested.如果故障再次发生,则再次重新测试该站点。

I want to have the last value of the test from the dataframe for that particular test_num and site_num .我想从 dataframe 获得该特定test_numsite_num的最后一个测试值。 So, if certain test_num and site_num for any number of the site appear again in the following rows, the final dataframe should have the last record.因此,如果任何数量的站点的某个test_numsite_num再次出现在以下行中,则最终的 dataframe 应该有最后一条记录。

So, the above dataframe should look like this -所以,上面的 dataframe 应该是这样的——

==Desired result===

TEST_NUM | HEAD_NUM | SITE_NUM | RESULT
________________________________________ 
10000   |  1          | 0       |  P
________________________________________ 
10000   |  1          | 1       |  P          ----> Replaced the row
________________________________________ 
10000   |  1          | 2      |  P           ----> Replaced the row
________________________________________ 
10000   |  1          | 3       |  P 
________________________________________ 

Ideally, the rows should be in their correct order.理想情况下,行应按正确的顺序排列。 Like for any test_num, site 0, then 1, then number 2, then 3.就像任何 test_num 一样,站点 0,然后是 1,然后是数字 2,然后是 3。

If the last records from a particular site cannot come in the original order(, or if it would be too messy), the following result would also do.如果来自特定站点的最后一条记录不能按原始顺序排列(或者如果它太乱),也可以使用以下结果。

==Result which can also do the trick===

TEST_NUM | HEAD_NUM | SITE_NUM | RESULT
________________________________________ 
10000   |  1          |  0       |  P
________________________________________ 
10000   |  1          |  3       |  P 
________________________________________ 
10000   |  1          |  1       |  P         ----> Not in the correct order but ok
________________________________________ 
10000   |  1          |  2      |  P          ---->  Kept the last, not in the original order but ok.

What I have tried --我试过的——

I have tried to maintain 3 variables(old_site, old_site, old_testnum) while parsing above dataframe from the text file.在从文本文件中解析 dataframe 上方时,我尝试维护 3 个变量(old_site、old_site、old_testnum)。 During the creation of each row from the text file, I check if the current site_num is same as the old_site value and if old_testnum value is same as current testnum value.在从文本文件创建每一行期间,我检查当前 site_num 是否与 old_site 值相同,以及 old_testnum 值是否与当前 testnum 值相同。 If this is the case, I popped up the last inserted value from the list(list is used to create dataframe after all the parsing) and then inserted the current value in the list, so only the last value remains.如果是这种情况,我从列表中弹出最后插入的值(列表用于在所有解析后创建 dataframe)然后将当前值插入列表中,因此只保留最后一个值。 But I made huge assumption that the duplicated value appears right after the original record, which can be seen not the case here[SITE_NUM = 1's repeated value comes after 2 SITE(site 2,3)].但是我做了一个很大的假设,即重复值出现在原始记录之后,在这里可以看出不是这种情况[SITE_NUM = 1 的重复值出现在 2 SITE(site 2,3)] 之后。

Can anyone suggest any way the desired result can be obtained(Desired result), or the other possible acceptable format?任何人都可以建议任何方式可以获得期望的结果(期望的结果),或其他可能的可接受的格式? It would be great if any API exist which can make it elegant.如果存在任何可以使其优雅的 API 那就太好了。

Here is a working example for what you're looking for in the question.这是您在问题中寻找的内容的工作示例。

#Reproducing dataframe
df = pd.DataFrame() 
df['TEST_NUM'] = [10000,10000, 10000, 10000, 10000, 10000] 
df['HEAD_NUM'] = [1,1,1,1,1,1]
df['SITE_NUM'] = [0,1,2,3,1,2]
df['RESULT'] = ['P', 'F', 'F', 'P', 'P', 'P'] 

df = df.drop_duplicates(['TEST_NUM', 'SITE_NUM'], keep='last')
df = df.sort_values('SITE_NUM')

I have just read your comment, and from what I understand you have another column 'test_txt' is an extra column and you wish to remove duplicates and then sort, primarily using the 'test_txt' column.我刚刚阅读了您的评论,据我了解,您还有另一列“test_txt”是一个额外的列,您希望删除重复项然后排序,主要使用“test_txt”列。

df = df.drop_duplicates(['test_txt', 'TEST_NUM', 'SITE_NUM'], keep='last')
df = df.sort_values('SITE_NUM')
df = df.sort_values(['test_txt', 'SITE_NUM'])

If this is not what you are looking for please update your question with further detail.如果这不是您要查找的内容,请更详细地更新您的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM