简体   繁体   English

需要帮助对数据进行排序

[英]Need help Sorting Through Data

I'm trying to clean some data for a computational biology research project.我正在尝试为计算生物学研究项目清理一些数据。 However, an issue is coming up where some dogs born from the same litter on the same day have the same mother but multiple fathers.然而,出现了一个问题,即同一窝同一窝出生的一些狗有相同的母亲,但有多个父亲。 I need to find these data points and return them in some for so that I can manually go back to the documents and check them.我需要找到这些数据点并将它们返回,以便我可以手动将 go 返回文档并检查它们。 Does anyone know a better way so that each set doesn't take 30+ minutes to finish?有谁知道更好的方法,使每组不需要 30 多分钟才能完成?

I have been trying to use pandas to go through the data so far, and I'm no CS wizard.到目前为止,我一直在尝试通过数据使用 pandas 到 go,而且我不是 CS 向导。 I basically used a for loop to check the data each individually and even the smaller sets have around 10k pieces of data.我基本上使用了一个 for 循环来单独检查每个数据,即使是较小的集合也有大约 10k 条数据。

data = raw_data.loc[:,['Order', 'Name', 'Sire', 'Dam', 'Registration', 'DOB']]
length = len(data.index)

for i in range(0,length,1):
    for j in range(i+1,length,1):
        if (data.iat[i,5]==data.iat[j,5]): #Same date of birth
            if (data.iat[i,3]==data.iat[j,3]): #Same mother
                if (data.iat[i,2]!= data.iat[j,2]): #Different father
                    print(data.iat[i,0]+data.iat[j,0])

You can group your data by date of birth and mother, then calculate the number of different values for the father columns.您可以按出生日期和母亲对数据进行分组,然后计算父亲列的不同值的数量。 The result will be calculated for every group of DOB and Dam.将为每组 DOB 和 Dam 计算结果。 You will be interested in all the groups with a result greater than 1.您将对结果大于 1 的所有组感兴趣。

import pandas as pd
data.groupby(by=['DOB','Dam']).\ # Group your data by 'DOB' and 'Dam'
aggregate({'Sire':pd.Series.nunique}).\ # Count distinct values for 'Sire' in each group
sort_values(by="Sire", ascending= False).\ # Descending order of the results
query("Sire > 1").\ # Take the 'DOB' and 'Dam' pairs with more than 1 'Sire'
to_excel("File_with_results.xlsx") # Write the results to an excel file

Welcome to Stackoverflow.欢迎来到 Stackoverflow。

One additional suggestion beyond Miguel's.米格尔之外的另一项建议。

For testing I would trim down your file to a small sample that includes the issue you are working on.为了进行测试,我会将您的文件缩减为一个包含您正在处理的问题的小样本。 You don't want to waste CPU time until you know the program is behaving.在您知道程序正在运行之前,您不想浪费 CPU 时间。

BDS北斗系统

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM