简体   繁体   English

Panda Dataframe 中的随机行大写

[英]Capitalize random rows in Panda Dataframe

I'm making a reverse denoisng autoencoder and I have a dataset but it's all lowercased, but I want 80% of the rows the source entry to be capitalized and only 60% of the target entries to be capitalized.我正在制作一个反向降噪自动编码器,我有一个数据集,但它都是小写的,但我希望源条目的 80% 行大写,目标条目的只有 60% 大写。 I wrote this我写了这个

import pandas as pd
import torch

df = pd.read_csv('Data/fb_moe.csv')

for i in range(len(df)):
    sample = int(torch.distributions.Bernoulli(torch.FloatTensor([.8])).sample())

    if sample == 1:
        df.iloc[i].y = str(df.iloc[i].y).capitalize()

        sample_1 = int(torch.distributions.Bernoulli(torch.FloatTensor([.6])).sample())

        if sample_1 == 1:
            df.iloc[i].x = str(df.iloc[i].x).capitalize()

df.to_csv('Data/fb_moe2.csv')

But this is pretty slow cause my csv is like 8 million rows is there a faster way to do this?但这很慢,因为我的 csv 就像 800 万行有没有更快的方法来做到这一点?

Part of the Dataframe Dataframe的一部分

x,y
jon,jun
an,jun
ju,jun
jin,jun
nun,jun
un,jun
jon,jun
jin,jun
nen,jun
ju,jun
jn,jun
jul,jun
jen,jun
hun,jun
ju,jun
hun,jun
hun,jun
jon,jun
jin,jun
un,jun
eun,jun
jhn,jun

Try adding some boolean mask and some apply functions, pandas does not behave quickly in for loops尝试添加一些 boolean 掩码和一些应用函数,pandas 在 for 循环中表现不佳

n = len(df)
source = np.random.binomial(1, p=.8, size=n) ==  1
target = source.copy()

total_source_true = np.sum(source)
target[source] = np.random.binomial(1, p=.6, size=total_source_true) == 1

df.loc[source, 'x'] = df.loc[source, 'x'].str.capitalize()
df.loc[target, 'y'] = df.loc[source, 'y'].str.capitalize()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM