简体   繁体   English

用熊猫过滤CSV文件

[英]filter CSV file with pandas

I have a CSV file where each row holds some data about a particular patient and a single patient can have multiple rows associated with him or her. 我有一个CSV文件,其中每一行都包含有关特定患者的一些数据,并且单个患者可以具有与其关联的多个行。

The file itself contains thousands of patient records and what I want to do is randomly select 100 patients from the file and then get all records associated with them and then save them to another CSV file. 该文件本身包含数千个患者记录,我要做的是从文件中随机选择100位患者,然后获取与它们相关的所有记录,然后将它们保存到另一个CSV文件中。

So, the file could look like, for example: 因此,该文件可能类似于:

patient_id   Date          Diagnosis   Comments
001-001      23.12.2008    Normal      Normal
001-001      23.12.2009    Normal      Normal
001-002      08.11.2007    Normal      Normal
001-003
....

So, I can load the file as: 因此,我可以将文件加载为:

frame = pd.read_csv('file.csv')
# Get the unique subjects
unique_subjects = frame['patient_id'].unique()
# Use numpy to randomly select some patients
random_us = np.random.choice(unique_subjects, 100)

And then I can load the CSV and then check row by row and select which rows to write back to the CSV file. 然后,我可以加载CSV,然后逐行检查并选择要写回CSV文件的行。

I have a feeling pandas might provide something more direct and I wonder if there is a way to pipe all these operations with it. 我觉得pandas可能会提供更直接的信息,我想知道是否有一种方法可以将所有这些操作与之结合。

您可以使用isin过滤所需的ID:

random_records = frame[frame['patient_id'].isnin(random_us)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM