如何刪除數據框中的重復字母？

Question

我有以下字符串：

"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"

我收集了很多類似的推文，並將它們分配給一個數據框。 我如何通過刪除“ hhhhhhhhhhhhhhhhhh”來清除數據幀中的那些行，而只保留該行中的其余字符串？

稍后我還將使用countVectorizer，因此有很多詞匯包含“ hhhhhhhhhhhhhhhhhhhhhh”

Answer 1

使用正則表達式。

例如：

import pandas as pd

df = pd.DataFrame({"Col": ["hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh", "Hello World"]})
#df["Col"] = df["Col"].str.replace(r"\b(.)\1+\b", "")
df["Col"] = df["Col"].str.replace(r"\s+(.)\1+\b", "").str.strip()
print(df)

輸出：

                                             Col
0  hello, I'm going to eat to the fullest today 
1                                    Hello World

Answer 2

您可以嘗試以下方法：

df["Col"] = df["Col"].str.replace(u"h{4,}", "")

在我的情況下，您可以設置要匹配的字符數4。

                                        Col
0  hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1                               Hello World
                     Col
0  hello, I'm today hh  
1            Hello World

因為您提到自己在推文中，所以我使用了unicode匹配。

如何刪除數據框中的重復字母？

問題描述

2 個解決方案

解決方案1
2 2019-05-14 07:22:19

解決方案2
1 已采納 2019-05-14 07:33:23

如何刪除數據框中的重復字母？

問題描述

2 個解決方案

解決方案1 2 2019-05-14 07:22:19

解決方案2 1 已采納 2019-05-14 07:33:23

解決方案1
2 2019-05-14 07:22:19

解決方案2
1 已采納 2019-05-14 07:33:23