[英]Binarize a float64 Pandas Dataframe in Python
I've got a Panda DF with various columns (each indicating the frequency of a word in a corpus). 我有一个带有各种列的Panda DF(每个列都表示一个词语中一个单词的频率)。 Each row corresponds to a document and each is of type float64. 每行对应一个文档,每个都是float64类型。
for example: 例如:
word1 word2 word3
0.0 0.3 1.0
0.1 0.0 0.5
etc
I want to Binarize this and instead of the frequency end up with a boolean (0s and 1s DF) that indicates the existence of a word 我想Binarize这个而不是频率最终用布尔值(0和1s DF)表示存在一个单词
so the above example would be transformed to : 所以上面的例子将转换为:
word1 word2 word3
0 1 1
1 0 1
etc
I looked at get_dummies(), but the output was not the expected. 我查看了get_dummies(),但输出不是预期的。
Casting to boolean will result in True
for anything that is not zero — and False
for any zero entry. 对于任何非零的内容,转换为布尔值将导致True
- 对于任何零条目,将导致False
。 If you then cast to integer, you get ones and zeroes. 如果然后转换为整数,则得到1和0。
import io
import pandas as pd
data = io.StringIO('''\
word1 word2 word3
0.0 0.3 1.0
0.1 0.0 0.5
''')
df = pd.read_csv(data, delim_whitespace=True)
res = df.astype(bool).astype(int)
print(res)
Output: 输出:
word1 word2 word3
0 0 1 1
1 1 0 1
Code: 码:
import numpy as np
import pandas as pd
""" create some test-data """
random_data = np.random.random([3, 3])
random_data[0,0] = 0.0
random_data[1,2] = 0.0
df = pd.DataFrame(random_data,
columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
print(df)
""" binarize """
threshold = lambda x: x > 0
df_ = df.apply(threshold).astype(int)
print(df_)
Output: 输出:
A B C
first 0.000000 0.610263 0.301024
second 0.728070 0.229802 0.000000
third 0.243811 0.335131 0.863908
A B C
first 0 1 1
second 1 1 0
third 1 1 1
Remarks: 备注:
Found an alternative way using Pandas Indexing. 找到了使用Pandas Indexing的替代方法。
This can be simply done by 这可以简单地完成
df[df>0] = 1
simple as that! 就那么简单!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.