[英]Get List of Unique String values per column in a dataframe using python
here I go with another question我再问一个问题
I have a large dataframe about 20 columns by 400.000 rows.我有一个大约 20 列 x 400.000 行的大数据框。 In this dataset I can not have string since the software that will process the data only accepts numeric and nulls.
在这个数据集中我不能有字符串,因为处理数据的软件只接受数字和空值。
So they way I am thinking it might work is following.所以他们的方式我认为它可能会起作用。 1. go thru each column 2. Get List of unique strings 3. Replace each string with a value from 0 to X 4. repeat the process for the next column 5. Repeat for the next dataframe
1. 遍历每一列 2. 获取唯一字符串列表 3. 将每个字符串替换为 0 到 X 之间的值 4. 对下一列重复该过程 5. 对下一个数据帧重复该过程
This is how the dataframe looks like这是数据框的样子
DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX
7-Feb-15 0:00:00 NORMAL NORMAL 1050
7-Feb-15 0:01:00 NORMAL NORMAL 1050
7-Feb-15 0:02:00 NORMAL HIGH 1050
7-Feb-15 0:03:00 HIGH NORMAL 1050
7-Feb-15 0:04:00 LOW NORMAL 1050
7-Feb-15 0:05:00 NORMAL LOW 1050
This is the result expected这是预期的结果
DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX
7-Feb-15 0:00:00 0 0 1050
7-Feb-15 0:01:00 0 0 1050
7-Feb-15 0:02:00 0 1 1050
7-Feb-15 0:03:00 1 0 1050
7-Feb-15 0:04:00 2 0 1050
7-Feb-15 0:05:00 0 2 1050
I am using python 3.5 and the latest version of Pandas我正在使用 python 3.5 和最新版本的 Pandas
Thanks in advance提前致谢
JV合资企业
Solution:解决方案:
# try to convert all columns to numbers...
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()
del st
Demo:演示:
In [233]: df
Out[233]:
DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX
0 7-Feb-15 0:00:00 NORMAL NORMAL 1050
1 7-Feb-15 0:01:00 NORMAL NORMAL 1050
2 7-Feb-15 0:02:00 NORMAL HIGH 1050
3 7-Feb-15 0:03:00 HIGH NORMAL 1050
4 7-Feb-15 0:04:00 LOW NORMAL 1050
5 7-Feb-15 0:05:00 NORMAL LOW 1050
first we stack all object
(string) columns:首先我们堆叠所有
object
(字符串)列:
In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
In [236]: st = df[cols].stack().to_frame('name')
now we can factorize stacked column:现在我们可以分解堆积列:
In [238]: st['cat'] = pd.factorize(st.name)[0]
In [239]: st
Out[239]:
name cat
0 FNRHP306H NORMAL 0
FNRHP306HC NORMAL 0
1 FNRHP306H NORMAL 0
FNRHP306HC NORMAL 0
2 FNRHP306H NORMAL 0
FNRHP306HC HIGH 1
3 FNRHP306H HIGH 1
FNRHP306HC NORMAL 0
4 FNRHP306H LOW 2
FNRHP306HC NORMAL 0
5 FNRHP306H NORMAL 0
FNRHP306HC LOW 2
assign unstacked result back to original DF (to object
columns):将未堆叠的结果分配回原始 DF(
object
列):
In [241]: df[cols] = st['cat'].unstack()
In [242]: df
Out[242]:
DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX
0 7-Feb-15 0:00:00 0 0 1050
1 7-Feb-15 0:01:00 0 0 1050
2 7-Feb-15 0:02:00 0 1 1050
3 7-Feb-15 0:03:00 1 0 1050
4 7-Feb-15 0:04:00 2 0 1050
5 7-Feb-15 0:05:00 0 2 1050
Explanation:解释:
In [248]: df.filter(like='FNR')
Out[248]:
FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX
0 NORMAL NORMAL 1050
1 NORMAL NORMAL 1050
2 NORMAL HIGH 1050
3 HIGH NORMAL 1050
4 LOW NORMAL 1050
5 NORMAL LOW 1050
In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
FNRHP306H FNRHP306HC
0 NORMAL NORMAL
1 NORMAL NORMAL
2 NORMAL HIGH
3 HIGH NORMAL
4 LOW NORMAL
5 NORMAL LOW
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.