[英]Split column in pandas of comma separated values but maintining the order
I have the following column in a dataframe: 我在数据框中有以下列:
column_1
en-us,en-en
pr,en-us,en-en,br
ar-ar,pr,en-en
I want to Split that column (this can be done with .str.split) but using .Split I will get: 我想拆分该列(可以通过.str.split完成),但是使用.Split,我将得到:
column_1 | column_2 | column_3 | column_4
en-us en-en
pr en-us en-en br
ar-ar pr en-en
And what I need is: 我需要的是:
column_1 | column_2 | column_3 | column_4
en-us en-en
en-us en-en br pr
ar-ar en-en pr
Is there any automatic way of doing this? 有自动的方法吗?
IIUC, you can do by passing a list of dictionaries to the default pd.DataFrame
constructor. IIUC,您可以通过将字典列表传递给默认的
pd.DataFrame
构造函数来完成。 For example, 例如,
df = pd.DataFrame(s.str.split(',').transform(lambda x: {k:k for k in x}).tolist())
yields 产量
r-ar br en-en en-us pr
0 NaN NaN en-en en-us NaN
1 NaN br en-en en-us pr
2 ar-ar NaN en-en NaN pr
Notice that it is trivial to reorder the data frame according to your needs, eg 请注意,根据您的需求重新排列数据框很简单,例如
>>> df[['en-en', 'en-us', 'br', 'pr']]
en-en en-us br pr
0 en-en en-us NaN NaN
1 en-en en-us br pr
2 en-en NaN NaN pr
And if you want to have empty strings rather than NaN
s, just use .fillna()
如果要使用空字符串而不是
NaN
,则只需使用.fillna()
df[['en-en', 'en-us', 'br', 'pr']].fillna('')
en-en en-us br pr
0 en-en en-us
1 en-en en-us br pr
2 en-en pr
Explanation 说明
Let's break down the following statement 让我们分解以下语句
s.str.split(',').transform(lambda x: {k:k for k in x}).tolist()
First of all, s.str.split(',')
does what you already know: splits using ,
as separator. 首先,
s.str.split(',')
做您已经知道的事情:使用,
作为分隔符进行拆分。 This yields the following series 这产生了以下系列
0 [en-us, en-en]
1 [pr, en-us, en-en, br]
2 [ar-ar, pr, en-en]
Name: col1, dtype: object
Now, we want to change each of these elements into a {key:value}
structure. 现在,我们想将每个元素更改为
{key:value}
结构。 For that, we use transform
passing a function to it: 为此,我们使用
transform
函数传递给它的transform
:
s.str.split(',').transform(function)
where function = lambda x: {k:k for k in x}
. 其中
function = lambda x: {k:k for k in x}
。 So basically we will run this func
for the input [en-us, en-en]
, then for [pr, en-us, en-en, br]
, etc. The output of this function is 因此,基本上,我们将针对输入
[en-us, en-en]
运行func
,然后针对[pr, en-us, en-en, br]
等运行此函数。此函数的输出为
0 {'en-en': 'en-en', 'en-us': 'en-us'}
1 {'br': 'br', 'en-en': 'en-en', 'en-us': 'en-us...
2 {'en-en': 'en-en', 'ar-ar': 'ar-ar', 'pr': 'pr'}
Now, we just use tolist()
to get a list of these values, and input that in the pd.DataFrame()
constructor. 现在,我们仅使用
tolist()
获取这些值的列表,并将其输入到pd.DataFrame()
构造函数中。 The constructor knows how to deal with lists of dictionaries pretty well, and it assigns values based on the keys
of the dictionaries for each row. 构造函数知道如何很好地处理字典列表,并且它基于字典的
keys
为每一行分配值。 Whenever no key/value is found for a row, it just uses NaN
s 每当找不到行的键/值时,它仅使用
NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.