简体   繁体   English

如何用整个 dataframe 中的列组的平均值替换字符串值

[英]How to replace a string value with the means of a column's groups in the entire dataframe

I have a large dataset with 400columns and 30,000 rows.我有一个包含 400 列和 30,000 行的大型数据集。 The dataset is all numerical but some columns have weird string values in them (denoted as "#?") instead of being blank.数据集全是数字,但有些列中有奇怪的字符串值(表示为“#?”)而不是空白。 This changes the dtypes of the columns that have "#?"这会更改具有“#?”的列的 dtypes into object type.进入 object 类型。 (150 columns object dtype) (150 列 object dtype)

I need to convert all the columns into float or int dtypes, and then fill the normal NaN values in the data, with means of a column's groups.我需要将所有列转换为 float 或 int dtypes,然后使用列的组填充数据中的正常 NaN 值。 (eg: means of X, means of Y in each column) (例如:每列中 X 的平均值,Y 的平均值)

col1 col2 col3
X    21    32 
X    NaN   3
Y    Nan   5 

My end goal is to apply this to the entire data:我的最终目标是将其应用于整个数据:

df.groupby("col1").transform(lambda x: x.fillna(x.mean()))

But I can't apply this for the columns that have "#?"但我不能将此应用于具有“#?”的列in them, they get dropped.在它们中,它们被丢弃。 I tried replacing the #?我尝试替换 #? with a numerical value, and then convert all the columns into float dtype, which works, but the replaced values also should be included in the above code.使用数值,然后将所有列转换为 float dtype,这可行,但替换的值也应包含在上述代码中。

I thought about replacing #?我想过更换#? with an weird value like -123.456 so that it doesn't get mixed with actual data points, and maybe replace all the -123.456 with the means of column groups but the -123.456 would need to be excluded from the mean.具有像 -123.456 这样的奇怪值,这样它就不会与实际数据点混合,并且可能将所有 -123.456 替换为列组的平均值,但需要将 -123.456 从平均值中排除。 But I just don't know how that would even work.但我只是不知道这将如何运作。 If I convert it back to NaN again, the dtype changes back to object.如果我再次将其转换回 NaN,则 dtype 会变回 object。

I think the best way to go about it would be directly replacing the #?我认为 go 最好的方法是直接替换#? with the column group means.与列组的意思。

Any ideas?有任何想法吗?

edit: I'm so dumb lol编辑:我太笨了哈哈

df=df.replace('#?', '').astype(float, errors = 'ignore')

this works.这行得通。

Use:利用:

print (df)
  col1 col2  col3
0    X   21    32
1    X   #?     3
2    Y  NaN     5

df = (df.set_index('col1')
        .replace(r'#\?', np.nan, regex=True)
        .astype(float)
        .groupby("col1")
        .transform(lambda x: x.fillna(x.mean())))
print (df)
      col2  col3
col1            
X     21.0  32.0
X     21.0   3.0
Y      NaN   5.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 替换整个 DataFrame 中的字符串/值 - Replace string/value in entire DataFrame 用值替换excel中整个数据框中的字符串 - replace a string in entire dataframe from excel with value 如何替换 Pandas.DataFrame 上的整个列 - how to replace an entire column on Pandas.DataFrame 如何在 dataframe 的 A 列中找到 B 列中的 dataframe 值,如果是,将 B 列中的值替换为 A 列的值? - How do I find in dataframe value in column B exists in Column A in a dataframe, and if so, replace the value in column B with Column A's value? 如何以索引方式替换整个Pandas DataFrame列? - How to replace an entire Pandas DataFrame column index-wise? 如何转换包含 1 和 0 的数据帧并将新列添加到表示 Python 中整行的十六进制值的同一数据帧 - how to convert a dataframe containing 1's and 0's and add a new column to the same dataframe that represents the hex value of entire row in python 如何逐列识别整个 pandas dataframe 的最小平方值? - How to identify minimum squared value of an entire pandas dataframe column by column? Python如何在数据框中替换列的值 - Python how to replace a column's values in dataframe 如何在特定列和行上替换 dataframe 中的字符串? - how to replace string in dataframe on specific column and row? 如何替换 dataframe 列中的特定最后一个字符串 - How to replace specific last string in a dataframe column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM