简体   繁体   English

Python:使用基于嵌套列表中的唯一值的列创建pandas数据帧

[英]Python: Create pandas dataframe with columns based off of unique values in nestled list

I have a nestled list containing various regions for each sample. 我有一个包含每个样本的不同区域的嵌套列表。 I would like to make a dataframe such that each row (sample) has the presence or absence of the corresponding region (column). 我想创建一个数据帧,使每行(样本)都存在或不存在相应的区域(列)。 For example, the data might look like this: 例如,数据可能如下所示:

region_list = [['North America'], ['North America', 'South America'], ['Asia'], ['North America', 'Asia', 'Australia']]

And the end dataframe would look something like this: 结束数据框看起来像这样:

North America    South America     Asia     Australia
1                0                 0        0
1                1                 0        0
0                0                 1        0
1                0                 1        1

I think I could probably figure out a way using nestled loops and appends, but is there be a more pythonic way to do this? 我想我可能想出一种使用嵌套循环和附加的方法,但是有更多的pythonic方法来做到这一点吗? Perhaps with numpy.where ? 也许有numpy.where

pandas
str.get_dummies

pd.Series(region_list).str.join('|').str.get_dummies()

   Asia  Australia  North America  South America
0     0          0              1              0
1     0          0              1              1
2     1          0              0              0
3     1          1              1              0

numpy
np.bincount with pd.factorize np.bincountpd.factorize

n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size

pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m),
    columns=u
)

   North America  South America  Asia  Australia
0              1              0     0          0
1              1              1     0          0
2              0              0     1          0
3              1              0     1          1

Timing 定时

%timeit pd.Series(region_list).str.join('|').str.get_dummies()
1000 loops, best of 3: 1.42 ms per loop

%%timeit
n = len(region_list)
i = np.arange(n).repeat([len(x) for x in region_list])
f, u = pd.factorize(np.concatenate(region_list))
m = u.size

pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m),
    columns=u
)
1000 loops, best of 3: 204 µs per loop

Let's try: 我们试试吧:

df = pd.DataFrame(region_list)

df2 = df.stack().reset_index(name='region')

df_out = pd.get_dummies(df2.set_index('level_0')['region']).groupby(level=0).sum().rename_axis(None)

print(df_out)

Output: 输出:

         Asia  Australia  North America  South America                                               
0           0          0              1              0
1           0          0              1              1
2           1          0              0              0
3           1          1              1              0

This will do the job! 这将完成这项工作!

import pandas as pd
import itertools
pd.get_dummies(pd.DataFrame(list(itertools.chain(*region_list)))

Output
       0_Asia  0_Australia  0_North America  0_South America
    0       0            0                1                0
    1       0            0                1                0
    2       0            0                0                1
    3       1            0                0                0
    4       0            0                1                0
    5       1            0                0                0
    6       0            1                0                0

You can use chain.from_iterable from itertools module and list comprehension : 您可以使用itertools模块中的chain.from_iterablelist comprehension

from itertools import chain

region_list = [['North America'], ['North America', 'South America'], ['Asia'], ['North America', 'Asia', 'Australia']]

regions = list(set(chain.from_iterable(region_list)))
vals = [[1 if j in k else 0 for j in regions] for k in region_list]
df = pd.DataFrame(vals, columns=regions)
print(df)

Output: 输出:

   Australia  Asia  North America  South America
0          0     0              1              0
1          0     0              1              1
2          0     1              0              0
3          1     1              1              0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python pandas 根据一列的唯一值创建多列 - Python pandas create multiple columns based on unique values of one column 使用 pandas 中的列及其唯一值创建一个 dataframe - Create a dataframe with columns and their unique values in pandas Pandas:基于两个不同的列创建唯一值的索引 - Pandas: Creating an index of unique values based off of two different columns 根据唯一值创建 pandas DataFrame 的新列? - Creating new columns of pandas DataFrame based on unique values? 根据来自Pandas Python中另外两个单独数据框的列名创建列值 - Create columns values based on columns names from another two seperate dataframe in Pandas Python 如何基于从python / pandas中现有列派生的列表创建新列? - How do I create new columns based off of a list derived from an existing column in python/pandas? 从熊猫数据框中的唯一行值创建新列 - Create new columns from unique row values in a pandas dataframe 根据列和值列表获取 Dataframe 索引 - Get Dataframe index based off of list of columns and values 如何根据其他列的值在数据框中创建新列? - How to create a new column in a dataframe based off values of other columns? Python Pandas:根据其他列中的唯一标识符创建具有最小值的新列 - Python Pandas: create new column with min values based on unique identifiers in other columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM