[英]How to quickly add a large list of value to the corresponding python pandas dataframe
I have a large csv file with the following format (example), the report_date is currently empty:我有一个大的 csv 文件,格式如下(示例),report_date 当前为空:
| ids | disease_code | report_date |
| --- | ------------ | ----------- |
| 10 | I202 | |
| 11 | I232 | |
| 11 | I242 | |
I generated a list of tuples from a data source like the following:我从数据源生成了一个元组列表,如下所示:
[(10, ['I202'], 2021-10-22), (11, ['I232', 'I242'], 2021-11-22), (11, ['I232', 'I242'], 2021-11-12),.....]
The above order is patient_id, disease_code and the reported_date (The dates are in order corresponding to the disease), for a patient who has more than one disease, the reported date was unfortunately separated into two tuples.上面的顺序是patient_id, disease_code and the reported_date(日期按照疾病的先后顺序),对于一个患有不止一种疾病的患者,reported date很不幸被分成了两个元组。 Now I want to fill the report_date column by matching the first two values of the tuple with the current csv, like this:
现在我想通过将元组的前两个值与当前的 csv 匹配来填充 report_date 列,如下所示:
| ids | disease_code | report_date |
| --- | ------------ | ----------- |
| 10 | I202 | 2021-10-22 |
| 11 | I232 | 2021-11-22 |
| 11 | I242 | 2021-11-12 |
I tried to use a nested loop but it seems like it will take 480 hours to complete.我尝试使用嵌套循环,但似乎需要 480 小时才能完成。 I believe there is a more simple answer but I could not figure it out.
我相信有一个更简单的答案,但我无法弄清楚。 Any hint would be appreciated.
任何提示将不胜感激。
First, you can create a dataframe with your data.首先,您可以使用您的数据创建一个 dataframe。 You'll see that the column
"disease_code"
contains a list of values, just as you mentioned:正如您提到的,您会看到
"disease_code"
列包含一个值列表:
>> df = pd.DataFrame(
[(10, ['I202'], "2021-10-22"), (11, ['I232', 'I242'], "2021-11-22"), (11, ['I232', 'I242'], "2021-11-12")],
columns=["ids", "disease_code", "report_date"],
)
>> df["report_date"] = pd.to_datetime(df["report_date"])
>> df
ids disease_code report_date
0 10 [I202] 2021-10-22
1 11 [I232, I242] 2021-11-22
2 11 [I232, I242] 2021-11-12
Now you need to separate the values in the "disease_code"
column by repeating the values in the other columns... pd.DataFrame.explode
does exactly that.现在您需要通过重复其他列中的值来分隔
"disease_code"
列中的值... pd.DataFrame.explode
正是这样做的。 This method transforms values in a list-like column to multiple rows:此方法将类似列表的列中的值转换为多行:
>> df.explode(["disease_code"]) # Explode the "disease_code" column
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
1 11 I242 2021-11-22
2 11 I232 2021-11-12
2 11 I242 2021-11-12
For new DataFrame use list comprehension:对于新的 DataFrame 使用列表理解:
L = [(10, ['I202'], '2021-10-22'),
(11, ['I232', 'I242'], '2021-11-22'),
(11, ['I232', 'I242'], '2021-11-12')]
df1 = pd.DataFrame([(a, x, c) for a, b, c in L for x in b],
columns=["ids", "disease_code", "report_date"])
print (df1)
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
2 11 I242 2021-11-22
3 11 I232 2021-11-12
4 11 I242 2021-11-12
Then DataFrame.merge
to original DataFrame df
, but because there are duplicates in ids, disease_code
columns first remove them:然后
DataFrame.merge
到原来的 DataFrame df
,但是因为ids, disease_code
列先去掉:
print (df)
ids disease_code report_date
0 10 I202 NaN
1 11 I232 NaN
2 11 I242 NaN
print (df1.drop_duplicates(['ids','disease_code']))
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
2 11 I242 2021-11-22
df = (df.drop('report_date', axis=1)
.merge(df1.drop_duplicates(['ids','disease_code']),
on=['ids','disease_code']))
print (df)
ids disease_code report_date
0 10 I202 2021-10-22
1 11 I232 2021-11-22
2 11 I242 2021-11-22
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.