[英]Python: Create pandas dataframe with columns based off of unique values in nestled list
I have a nestled list containing various regions for each sample. 我有一个包含每个样本的不同区域的嵌套列表。 I would like to make a dataframe such that each row (sample) has the presence or absence of the corresponding region (column). 我想创建一个数据帧,使每行(样本)都存在或不存在相应的区域(列)。 For example, the data might look like this: 例如,数据可能如下所示:
region_list = [['North America'], ['North America', 'South America'], ['Asia'], ['North America', 'Asia', 'Australia']]
And the end dataframe would look something like this: 结束数据框看起来像这样:
North America South America Asia Australia
1 0 0 0
1 1 0 0
0 0 1 0
1 0 1 1
I think I could probably figure out a way using nestled loops and appends, but is there be a more pythonic way to do this? 我想我可能想出一种使用嵌套循环和附加的方法,但是有更多的pythonic方法来做到这一点吗? Perhaps with numpy.where
? 也许有numpy.where
?
pandas
str.get_dummies
pd.Series(region_list).str.join('|').str.get_dummies()
Asia Australia North America South America
0 0 0 1 0
1 0 0 1 1
2 1 0 0 0
3 1 1 1 0
numpy
np.bincount
with pd.factorize
np.bincount
与pd.factorize
n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size
pd.DataFrame(
np.bincount(i * m + f, minlength=n * m).reshape(n, m),
columns=u
)
North America South America Asia Australia
0 1 0 0 0
1 1 1 0 0
2 0 0 1 0
3 1 0 1 1
Timing 定时
%timeit pd.Series(region_list).str.join('|').str.get_dummies()
1000 loops, best of 3: 1.42 ms per loop
%%timeit
n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size
pd.DataFrame(
np.bincount(i * m + f, minlength=n * m).reshape(n, m),
columns=u
)
1000 loops, best of 3: 204 µs per loop
Let's try: 我们试试吧:
df = pd.DataFrame(region_list)
df2 = df.stack().reset_index(name='region')
df_out = pd.get_dummies(df2.set_index('level_0')['region']).groupby(level=0).sum().rename_axis(None)
print(df_out)
Output: 输出:
Asia Australia North America South America
0 0 0 1 0
1 0 0 1 1
2 1 0 0 0
3 1 1 1 0
This will do the job! 这将完成这项工作!
import pandas as pd
import itertools
pd.get_dummies(pd.DataFrame(list(itertools.chain(*region_list)))
Output
0_Asia 0_Australia 0_North America 0_South America
0 0 0 1 0
1 0 0 1 0
2 0 0 0 1
3 1 0 0 0
4 0 0 1 0
5 1 0 0 0
6 0 1 0 0
You can use chain.from_iterable
from itertools
module and list comprehension
: 您可以使用itertools
模块中的chain.from_iterable
和list comprehension
:
from itertools import chain
region_list = [['North America'], ['North America', 'South America'], ['Asia'], ['North America', 'Asia', 'Australia']]
regions = list(set(chain.from_iterable(region_list)))
vals = [[1 if j in k else 0 for j in regions] for k in region_list]
df = pd.DataFrame(vals, columns=regions)
print(df)
Output: 输出:
Australia Asia North America South America
0 0 0 1 0
1 0 0 1 1
2 0 1 0 0
3 1 1 1 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.