简体   繁体   English

使用python获取数据框中每列的唯一字符串值列表

[英]Get List of Unique String values per column in a dataframe using python

here I go with another question我再问一个问题

I have a large dataframe about 20 columns by 400.000 rows.我有一个大约 20 列 x 400.000 行的大数据框。 In this dataset I can not have string since the software that will process the data only accepts numeric and nulls.在这个数据集中我不能有字符串,因为处理数据的软件只接受数字和空值。

So they way I am thinking it might work is following.所以他们的方式我认为它可能会起作用。 1. go thru each column 2. Get List of unique strings 3. Replace each string with a value from 0 to X 4. repeat the process for the next column 5. Repeat for the next dataframe 1. 遍历每一列 2. 获取唯一字符串列表 3. 将每个字符串替换为 0 到 X 之间的值 4. 对下一列重复该过程 5. 对下一个数据帧重复该过程

This is how the dataframe looks like这是数据框的样子

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 NORMAL      NORMAL      1050
7-Feb-15    0:01:00 NORMAL      NORMAL      1050
7-Feb-15    0:02:00 NORMAL      HIGH        1050
7-Feb-15    0:03:00 HIGH        NORMAL      1050
7-Feb-15    0:04:00 LOW         NORMAL      1050
7-Feb-15    0:05:00 NORMAL      LOW         1050

This is the result expected这是预期的结果

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 0           0           1050
7-Feb-15    0:01:00 0           0           1050
7-Feb-15    0:02:00 0           1           1050
7-Feb-15    0:03:00 1           0           1050
7-Feb-15    0:04:00 2           0           1050
7-Feb-15    0:05:00 0           2           1050

在此处输入图片说明

I am using python 3.5 and the latest version of Pandas我正在使用 python 3.5 和最新版本的 Pandas

Thanks in advance提前致谢

JV合资企业

Solution:解决方案:

# try to convert all columns to numbers...
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()

del st

Demo:演示:

In [233]: df
Out[233]:
       DATE     TIME FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00    NORMAL     NORMAL               1050
1  7-Feb-15  0:01:00    NORMAL     NORMAL               1050
2  7-Feb-15  0:02:00    NORMAL       HIGH               1050
3  7-Feb-15  0:03:00      HIGH     NORMAL               1050
4  7-Feb-15  0:04:00       LOW     NORMAL               1050
5  7-Feb-15  0:05:00    NORMAL        LOW               1050

first we stack all object (string) columns:首先我们堆叠所有object (字符串)列:

In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

In [236]: st = df[cols].stack().to_frame('name')

now we can factorize stacked column:现在我们可以分解堆积列:

In [238]: st['cat'] = pd.factorize(st.name)[0]

In [239]: st
Out[239]:
                name  cat
0 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
1 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
2 FNRHP306H   NORMAL    0
  FNRHP306HC    HIGH    1
3 FNRHP306H     HIGH    1
  FNRHP306HC  NORMAL    0
4 FNRHP306H      LOW    2
  FNRHP306HC  NORMAL    0
5 FNRHP306H   NORMAL    0
  FNRHP306HC     LOW    2

assign unstacked result back to original DF (to object columns):将未堆叠的结果分配回原始 DF( object列):

In [241]: df[cols] = st['cat'].unstack()

In [242]: df
Out[242]:
       DATE     TIME  FNRHP306H  FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00          0           0               1050
1  7-Feb-15  0:01:00          0           0               1050
2  7-Feb-15  0:02:00          0           1               1050
3  7-Feb-15  0:03:00          1           0               1050
4  7-Feb-15  0:04:00          2           0               1050
5  7-Feb-15  0:05:00          0           2               1050

Explanation:解释:

In [248]: df.filter(like='FNR')
Out[248]:
  FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0    NORMAL     NORMAL               1050
1    NORMAL     NORMAL               1050
2    NORMAL       HIGH               1050
3      HIGH     NORMAL               1050
4       LOW     NORMAL               1050
5    NORMAL        LOW               1050

In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
  FNRHP306H FNRHP306HC
0    NORMAL     NORMAL
1    NORMAL     NORMAL
2    NORMAL       HIGH
3      HIGH     NORMAL
4       LOW     NORMAL
5    NORMAL        LOW

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 python 的其他列条件下获取 dataframe 中列的唯一值 - Get the unique values of column in dataframe on the other column condition in python 尝试从 Python 中的 Pandas 数据框列获取唯一值时,如何克服不可散列类型:“列表”错误 - How to overcome unhashable type: 'list' error, when trying to get unique values from a pandas dataframe column in Python 使用Python按列表中的唯一列值分组 - group by unique column values in list using Python 使用Python从列中获取唯一值 - Get unique values from a column using Python Pyspark Dataframe从具有字符串作为元素列表的列中获取唯一元素 - Pyspark Dataframe get unique elements from column with string as list of elements 有列表时如何获取 dataframe 列的唯一值 - python - How to get unique values of a dataframe column when there are lists - python 如何使用数据框列中的唯一值创建列表列表? - How to create a list of lists using unique values in a dataframe column? 通过使用唯一的行值python pandas创建列来转换数据框 - Transforming dataframe by making column using unique row values python pandas Pandas:将 dataframe 的列从列表转换为字符串,并且字符串只有列表的唯一值 - Pandas: Convert column of dataframe from list to string and the string to have only unique values of list 根据每列的唯一值切割 Pandas dataframe - Cut Pandas dataframe based on unique values per column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM