简体   繁体   English

从Pandas Dataframe有效地提取数据子集

[英]Extracting subset of data efficiently from Pandas Dataframe

I have 6 pandas dataframes (Patients, Test1, Test2, Test3, Test4, Test5) linked by an ID key. 我有6个通过ID键链接的熊猫数据帧(患者,Test1,Test2,Test3,Test4,Test5)。

Each row in the Patients dataframe represents a patient containing a unique ID there are 200000+ patients/rows. “患者”数据框中的每一行代表一个包含唯一ID的患者,共有200000+个患者/行。

Each row in the Test dataframes represents a test result on a day. 测试数据框中的每一行代表一天中的测试结果。 The columns for the Test dataframes are ID, DATE, TEST_UNIT, TEST_RESULT. 测试数据帧的列是ID,DATE,TEST_UNIT,TEST_RESULT。 Each of the Test dataframes contains between 6,000,000 to 7,000,000 rows. 每个测试数据帧包含6,000,000至7,000,000行。

I want to loop through all the IDs in the Patients dataframe and in each iteration use the ID to extract relevant test data from each of the 5 Test dataframes and do some processing on them. 我想遍历Patients数据框中的所有ID,并在每次迭代中使用该ID从5个Test数据框中的每一个提取相关的测试数据,并对它们进行一些处理。

If I do 如果我做

for i in range(len(Patients)):
    ind_id = Patients.ID.iloc[i]
    ind_test1 = Test1[Test1['ID'] == ind_id]
    ind_test2 = Test2[Test2['ID'] == ind_id]
    ind_test3 = Test3[Test3['ID'] == ind_id]
    ind_test4 = Test4[Test4['ID'] == ind_id]
    ind_test3 = Test5[Test5['ID'] == ind_id]

It takes about 3.6 seconds per iteration. 每次迭代大约需要3.6秒。

When I tried to speed it up by using the Numpy interface. 当我尝试使用Numpy界面加快速度时。

Patients_v = Patients.values
Test1_v = Test1.values
Test2_v = Test2.values
Test3_v = Test3.values
Test4_v = Test4.values
Test5_v = Test5.values

for i in range(len(Patients_v)): 
    ind_id = Patients_v[i, ID_idx]
    ind_test1 = Test1_v[Test1_v[:, 0] == ind_id]
    ind_test2 = Test2_v[Test2_v[:, 0] == ind_id] 
    ind_test3 = Test3_v[Test3_v[:, 0] == ind_id] 
    ind_test4 = Test4_v[Test4_v[:, 0] == ind_id] 
    ind_test5 = Test5_v[Test5_v[:, 0] == ind_id]  

It takes about 0.9 seconds per iteration. 每次迭代大约需要0.9秒。

How can I speed this up? 我怎样才能加快速度?

Thank you 谢谢

It is unclear what output you desire. 尚不清楚您需要什么输出。 We can only assume that you want patient-specific dataframes. 我们只能假设您需要特定于患者的数据框。

In any case, your current code will have to hold all dataframes in memory. 无论如何,您当前的代码将必须将所有数据帧保存在内存中。 This is inefficient. 这是低效的。 Look at, for example, generator functions : 看一下生成器函数

1. Create a list of all IDs 1.创建所有ID的列表

ALL_IDS = Patients.IDs.tolist()                        # Assuming all you need is the ID

2. Create a master dataframe 2.创建一个主数据框

ALL_DFS = [Test1, Test2, Test3, Test4, Test5]
df_master = pd.concat(ALL_DFS)

3. Create generator function that yields patient-specific dataframes for further processing 3.创建生成器功能,以生成特定于患者的数据框以进行进一步处理

def patient_slices(ALL_IDS):                           # Generator
    for ID in ALL_IDS:
        df_slice = df_master[df_master.ID == ID]
        yield df_slice

df_slice = patient_slices(ALL_IDS)                      
for _ in xrange(len(ALL_IDS)):                         # Call the generator n times
    sinlge_patient = next(df_slice)                    # Next patient for every call    
    your_processing(sinlge_patient)                    # Do your magic

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM