用正則表達式替換 pandas dataframe 列中的文本

Question

我有一個 dataframe 有兩列：“名稱”和“分數和評論”。

“分數和評論”列中的每個值都將從以下 3 種情況之一開始：

0-999 的數字，僅此而已
一個 0-999 的數字，后跟一串文本
一個字符串

我想更改“評分和評論”列的值，以便：

如果它以數字開頭，請刪除數字之后的所有文本，但保留數字。 數字將在 1-999 的范圍內，但不會更高。
如果沒有數字，則將所有文本替換為“0”
如果只有一個數字，不要管它。

我曾嘗試查看正則表達式，但無論我從錯誤的角度接近這個問題，我都無法理解。

我已經嘗試過 myDataFrame.replace('[0-9]{1,3}\s*', '') 但我能得到的最接近的是它匹配前 3 個字符（如果它們是數字）並擺脫那些.

Answer 1

這是str.extract的一個很好的用例，它使用正則表達式並且只留下匹配組。 例如：

>>> x = pd.Series(["100 some text", "1", "123", "text that should be 0"])
>>> x.str.extract(r'(^[0-9]{1,3})').fillna(0)
     0
0  100
1    1
2  123
3    0

因此，假設您不需要擔心不是 0-999 的數字，您可以這樣做：

myDataFrame["Score and comment"].str.extract(r'(^[0-9]{1,3})').fillna(0)

Answer 2

我將一些示例數據放入 csv 文件中，如下所示：

Name,Score and comment
Amun,123 this is cool
mirjam,23 this is nice
munkel,2 that's just amazing
punkel,this is funny
Rolf,123
Rolf,2
Mirjam2,17
Mirjaa,das ist gut

然后我運行了以下代碼：

    import pandas as pd
    df = pd.read_csv("/filepath/sample_data.txt")
    
    score = df["Score and comment"]
    #here, you first convert the cells that contain only numbers,
    #then, you delete the strings after the numbers in the mixed cells
    #and finally, you set the cells containing only strings to 0 
    for i in range(0,len(score)):
       try:
           score[i] = int(score[i])
       except:
           try:
               score[i] = score[i].split(" ")[0]
               score[i] = int(score[i])
           except:
               score[i] = 0   
    
    #save the file to a new csv
    df.to_csv("/filepath/sample_data_convertet.txt", index = False)

output 如下：

Name,Score and comment
Amun,123
mirjam,23
munkel,2
punkel,0
Rolf,123
Rolf,2
Mirjam2,17
Mirjaa,0

它對我來說很好:-)我希望這會有所幫助

用正則表達式替換 pandas dataframe 列中的文本

問題描述

2 個解決方案

解決方案1
0 2020-11-27 18:02:08

解決方案2
0 2020-11-27 18:58:57

用正則表達式替換 pandas dataframe 列中的文本

問題描述

2 個解決方案

解決方案1 0 2020-11-27 18:02:08

解決方案2 0 2020-11-27 18:58:57

解決方案1
0 2020-11-27 18:02:08

解決方案2
0 2020-11-27 18:58:57