使用python获取数据框中每列的唯一字符串值列表

Question

here I go with another question我再问一个问题

I have a large dataframe about 20 columns by 400.000 rows.我有一个大约 20 列 x 400.000 行的大数据框。 In this dataset I can not have string since the software that will process the data only accepts numeric and nulls.在这个数据集中我不能有字符串，因为处理数据的软件只接受数字和空值。

So they way I am thinking it might work is following.所以他们的方式我认为它可能会起作用。 1. go thru each column 2. Get List of unique strings 3. Replace each string with a value from 0 to X 4. repeat the process for the next column 5. Repeat for the next dataframe 1. 遍历每一列 2. 获取唯一字符串列表 3. 将每个字符串替换为 0 到 X 之间的值 4. 对下一列重复该过程 5. 对下一个数据帧重复该过程

This is how the dataframe looks like这是数据框的样子

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 NORMAL      NORMAL      1050
7-Feb-15    0:01:00 NORMAL      NORMAL      1050
7-Feb-15    0:02:00 NORMAL      HIGH        1050
7-Feb-15    0:03:00 HIGH        NORMAL      1050
7-Feb-15    0:04:00 LOW         NORMAL      1050
7-Feb-15    0:05:00 NORMAL      LOW         1050

This is the result expected这是预期的结果

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 0           0           1050
7-Feb-15    0:01:00 0           0           1050
7-Feb-15    0:02:00 0           1           1050
7-Feb-15    0:03:00 1           0           1050
7-Feb-15    0:04:00 2           0           1050
7-Feb-15    0:05:00 0           2           1050

I am using python 3.5 and the latest version of Pandas我正在使用 python 3.5 和最新版本的 Pandas

Thanks in advance提前致谢

JV合资企业

Answer 1

Solution:解决方案：

# try to convert all columns to numbers...
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()

del st

Demo:演示：

In [233]: df
Out[233]:
       DATE     TIME FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00    NORMAL     NORMAL               1050
1  7-Feb-15  0:01:00    NORMAL     NORMAL               1050
2  7-Feb-15  0:02:00    NORMAL       HIGH               1050
3  7-Feb-15  0:03:00      HIGH     NORMAL               1050
4  7-Feb-15  0:04:00       LOW     NORMAL               1050
5  7-Feb-15  0:05:00    NORMAL        LOW               1050

first we stack all object (string) columns:首先我们堆叠所有object （字符串）列：

In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

In [236]: st = df[cols].stack().to_frame('name')

now we can factorize stacked column:现在我们可以分解堆积列：

In [238]: st['cat'] = pd.factorize(st.name)[0]

In [239]: st
Out[239]:
                name  cat
0 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
1 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
2 FNRHP306H   NORMAL    0
  FNRHP306HC    HIGH    1
3 FNRHP306H     HIGH    1
  FNRHP306HC  NORMAL    0
4 FNRHP306H      LOW    2
  FNRHP306HC  NORMAL    0
5 FNRHP306H   NORMAL    0
  FNRHP306HC     LOW    2

assign unstacked result back to original DF (to object columns):将未堆叠的结果分配回原始 DF（ object列）：

In [241]: df[cols] = st['cat'].unstack()

In [242]: df
Out[242]:
       DATE     TIME  FNRHP306H  FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00          0           0               1050
1  7-Feb-15  0:01:00          0           0               1050
2  7-Feb-15  0:02:00          0           1               1050
3  7-Feb-15  0:03:00          1           0               1050
4  7-Feb-15  0:04:00          2           0               1050
5  7-Feb-15  0:05:00          0           2               1050

Explanation:解释：

In [248]: df.filter(like='FNR')
Out[248]:
  FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0    NORMAL     NORMAL               1050
1    NORMAL     NORMAL               1050
2    NORMAL       HIGH               1050
3      HIGH     NORMAL               1050
4       LOW     NORMAL               1050
5    NORMAL        LOW               1050

In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
  FNRHP306H FNRHP306HC
0    NORMAL     NORMAL
1    NORMAL     NORMAL
2    NORMAL       HIGH
3      HIGH     NORMAL
4       LOW     NORMAL
5    NORMAL        LOW

使用python获取数据框中每列的唯一字符串值列表

问题描述

1 个解决方案

解决方案1
1 2016-09-22 20:49:17

使用python获取数据框中每列的唯一字符串值列表

问题描述

1 个解决方案

解决方案1 1 2016-09-22 20:49:17

解决方案1
1 2016-09-22 20:49:17