用唯一的随机字符串替换列中的字符串

Question

我有一个具有多列的 csv，其中一列由字符串组成。

我从读取 csv 文件开始，然后只使用两列

df = pd.read_csv("MyDATA_otherstring.csv", usecols=["describe_file", "data_numbers"])

这是输出

    describe_file   data_numbers
0   This is the start of the story  7309.0
1   This is the start of the story  35.0
2   This is the start of the story  302.0
3   Difficult part  7508.5
4   Difficult part  363.0

在大约 10k 行中，有大约 150 个独特的字符串。 这些字符串在文件中多次出现。

我的目标按第一个字符串示例“这是故事的开始”过滤，并用随机字符串替换它。

我想遍历该列中的所有字符串并用唯一的字符串替换它们

我查看了随机库以及这里提出的一些问题，不幸的是我没有找到任何可以帮助我的东西。

Answer 1

这是你的例子：

import pandas as pd
import numpy as np
from string import ascii_lowercase

df = pd.DataFrame([['This is the start of the story']*3 + ['Difficult part']*2, 
    np.random.rand(5)], index=['describe_file', 'data_numbers']).T

                    describe_file data_numbers
0  This is the start of the story     0.825913
1  This is the start of the story     0.704422
2  This is the start of the story      0.91563
3                  Difficult part     0.192693
4                  Difficult part     0.795088

你可以这样做：

df.describe_file = df.join(df.groupby('describe_file')['describe_file'].apply(lambda x:
    ''.join(np.random.choice(list(ascii_lowercase), 10))), \
    on='describe_file', rsuffix='_NEW')['describe_file_NEW']

结果：

  describe_file data_numbers
0    skgfdrsktw     0.204907
1    skgfdrsktw     0.399947
2    skgfdrsktw     0.990196
3    rziuoslpqn     0.930852
4    rziuoslpqn     0.210122

Answer 2

@Nicolas Gervais 的先前答案很好，但在阅读了几次问题后，我解释说这个问题是用随机字符串替换“这是故事的一部分”，但将其余的“困难部分”保留原样. 包含.replace()语句的以下命令正在执行此操作。

df['describe_file'].apply(lambda x: x.replace('This is the start of the story', ''.join(np.random.choice(list(ascii_lowercase), 10))))

0        glhrtqwlnl
1        qxrklnxhoj
2        kszgtysptj
3    Difficult part
4    Difficult part
Name: describe_file, dtype: object

用唯一的随机字符串替换列中的字符串

问题描述

2 个解决方案

解决方案1
1 2020-03-13 19:32:12

解决方案2
0 2020-03-13 20:35:44

用唯一的随机字符串替换列中的字符串

问题描述

2 个解决方案

解决方案1 1 2020-03-13 19:32:12

解决方案2 0 2020-03-13 20:35:44

解决方案1
1 2020-03-13 19:32:12

解决方案2
0 2020-03-13 20:35:44