简体   繁体   English

在Python中二进制化float64 Pandas Dataframe

[英]Binarize a float64 Pandas Dataframe in Python

I've got a Panda DF with various columns (each indicating the frequency of a word in a corpus). 我有一个带有各种列的Panda DF(每个列都表示一个词语中一个单词的频率)。 Each row corresponds to a document and each is of type float64. 每行对应一个文档,每个都是float64类型。

for example: 例如:

word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
etc

I want to Binarize this and instead of the frequency end up with a boolean (0s and 1s DF) that indicates the existence of a word 我想Binarize这个而不是频率最终用布尔值(0和1s DF)表示存在一个单词

so the above example would be transformed to : 所以上面的例子将转换为:

word1 word2 word3
0      1     1
1      0     1
etc

I looked at get_dummies(), but the output was not the expected. 我查看了get_dummies(),但输出不是预期的。

Casting to boolean will result in True for anything that is not zero — and False for any zero entry. 对于任何非零的内容,转换为布尔值将导致True - 对于任何零条目,将导致False If you then cast to integer, you get ones and zeroes. 如果然后转换为整数,则得到1和0。

import io
import pandas as pd

data = io.StringIO('''\
word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
''')
df = pd.read_csv(data, delim_whitespace=True)

res = df.astype(bool).astype(int)
print(res)

Output: 输出:

   word1  word2  word3
0      0      1      1
1      1      0      1

I would have answered as @Alberto Garcia-Raboso answered but here is an alternative that is very quick and leverages the same idea. 我会回答@Alberto Garcia-Raboso回答,但这里有一个非常快速的替代方案并且利用相同的想法。

Use np.where 使用np.where

pd.DataFrame(np.where(df, 1, 0), df.index, df.columns)

在此输入图像描述


Timing 定时

在此输入图像描述

Code: 码:

import numpy as np
import pandas as pd

""" create some test-data """
random_data = np.random.random([3, 3])
random_data[0,0] = 0.0
random_data[1,2] = 0.0

df = pd.DataFrame(random_data,
     columns=['A', 'B', 'C'], index=['first', 'second', 'third'])

print(df)

""" binarize """
threshold = lambda x: x > 0
df_ = df.apply(threshold).astype(int)

print(df_)

Output: 输出:

A         B         C
first   0.000000  0.610263  0.301024
second  0.728070  0.229802  0.000000
third   0.243811  0.335131  0.863908
A  B  C
first   0  1  1
second  1  1  0
third   1  1  1

Remarks: 备注:

  • get_dummies() analyze each unique value per column and introduces new columns (for each unique value) to mark if this value is active get_dummies()分析每列的每个唯一值,并引入新列(对于每个唯一值)以标记此值是否处于活动状态
  • = if column A has 20 unique values, 20 new columns are added, where exactly one column is true, the others are false =如果列A有20个唯一值,则添加20个新列,其中只有一列为真,其他列为假

Found an alternative way using Pandas Indexing. 找到了使用Pandas Indexing的替代方法。

This can be simply done by 这可以简单地完成

df[df>0] = 1

simple as that! 就那么简单!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM