在Python中二进制化float64 Pandas Dataframe

Question

I've got a Panda DF with various columns (each indicating the frequency of a word in a corpus). 我有一个带有各种列的Panda DF（每个列都表示一个词语中一个单词的频率）。 Each row corresponds to a document and each is of type float64. 每行对应一个文档，每个都是float64类型。

for example: 例如：

word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
etc

I want to Binarize this and instead of the frequency end up with a boolean (0s and 1s DF) that indicates the existence of a word 我想Binarize这个而不是频率最终用布尔值（0和1s DF）表示存在一个单词

so the above example would be transformed to : 所以上面的例子将转换为：

word1 word2 word3
0      1     1
1      0     1
etc

I looked at get_dummies(), but the output was not the expected. 我查看了get_dummies（），但输出不是预期的。

Answer 1

Casting to boolean will result in True for anything that is not zero — and False for any zero entry. 对于任何非零的内容，转换为布尔值将导致True - 对于任何零条目，将导致False 。 If you then cast to integer, you get ones and zeroes. 如果然后转换为整数，则得到1和0。

import io
import pandas as pd

data = io.StringIO('''\
word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
''')
df = pd.read_csv(data, delim_whitespace=True)

res = df.astype(bool).astype(int)
print(res)

Output: 输出：

   word1  word2  word3
0      0      1      1
1      1      0      1

Answer 2

I would have answered as @Alberto Garcia-Raboso answered but here is an alternative that is very quick and leverages the same idea. 我会回答@Alberto Garcia-Raboso回答，但这里有一个非常快速的替代方案并且利用相同的想法。

Use np.where 使用np.where

pd.DataFrame(np.where(df, 1, 0), df.index, df.columns)

Timing 定时

Answer 3

Code: 码：

import numpy as np
import pandas as pd

""" create some test-data """
random_data = np.random.random([3, 3])
random_data[0,0] = 0.0
random_data[1,2] = 0.0

df = pd.DataFrame(random_data,
     columns=['A', 'B', 'C'], index=['first', 'second', 'third'])

print(df)

""" binarize """
threshold = lambda x: x > 0
df_ = df.apply(threshold).astype(int)

print(df_)

Output: 输出：

A         B         C
first   0.000000  0.610263  0.301024
second  0.728070  0.229802  0.000000
third   0.243811  0.335131  0.863908
A  B  C
first   0  1  1
second  1  1  0
third   1  1  1

Remarks: 备注：

get_dummies() analyze each unique value per column and introduces new columns (for each unique value) to mark if this value is active get_dummies（）分析每列的每个唯一值，并引入新列（对于每个唯一值）以标记此值是否处于活动状态
= if column A has 20 unique values, 20 new columns are added, where exactly one column is true, the others are false =如果列A有20个唯一值，则添加20个新列，其中只有一列为真，其他列为假

Answer 4

Found an alternative way using Pandas Indexing. 找到了使用Pandas Indexing的替代方法。

This can be simply done by 这可以简单地完成

df[df>0] = 1

simple as that! 就那么简单！

在Python中二进制化float64 Pandas Dataframe

问题描述

4 个解决方案

解决方案1
5 2016-09-27 23:36:02

解决方案2
1 2016-09-28 00:09:23

Timing 定时

解决方案3
0 2016-09-27 23:19:33

解决方案4
0 2016-10-04 19:55:36

在Python中二进制化float64 Pandas Dataframe

问题描述

4 个解决方案

解决方案1 5 2016-09-27 23:36:02

解决方案2 1 2016-09-28 00:09:23

Timing 定时

解决方案3 0 2016-09-27 23:19:33

解决方案4 0 2016-10-04 19:55:36

解决方案1
5 2016-09-27 23:36:02

解决方案2
1 2016-09-28 00:09:23

解决方案3
0 2016-09-27 23:19:33

解决方案4
0 2016-10-04 19:55:36