简体   繁体   English

如何删除数据框中的重复字母?

[英]How to remove repeating letter in a dataframe?

I have the following string: 我有以下字符串:

"hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh"

I have collected many tweets like that and assigned them to a dataframe. 我收集了很多类似的推文,并将它们分配给一个数据框。 How can I clean those rows in dataframe by removing "hhhhhhhhhhhhhhhhhh" and only let the rest of the string in that row? 我如何通过删除“ hhhhhhhhhhhhhhhhhh”来清除数据帧中的那些行,而只保留该行中的其余字符串?

I'm also using countVectorizer later, so there was a lot of vocabularies that contained 'hhhhhhhhhhhhhhhhhhhhhhh' 稍后我还将使用countVectorizer,因此有很多词汇包含“ hhhhhhhhhhhhhhhhhhhhhh”

Using Regex. 使用正则表达式。

Ex: 例如:

import pandas as pd

df = pd.DataFrame({"Col": ["hello, I'm going to eat to the fullest today hhhhhhhhhhhhhhhhhhhhh", "Hello World"]})
#df["Col"] = df["Col"].str.replace(r"\b(.)\1+\b", "")
df["Col"] = df["Col"].str.replace(r"\s+(.)\1+\b", "").str.strip()
print(df)

Output: 输出:

                                             Col
0  hello, I'm going to eat to the fullest today 
1                                    Hello World

You may try this: 您可以尝试以下方法:

df["Col"] = df["Col"].str.replace(u"h{4,}", "")

Where you may set the number of characters to match in my case 4. 在我的情况下,您可以设置要匹配的字符数4。

                                        Col
0  hello, I'm today hh hhhh hhhhhhhhhhhhhhh
1                               Hello World
                     Col
0  hello, I'm today hh  
1            Hello World

I used unicode matching, since you mentioned you are in tweets. 因为您提到自己在推文中,所以我使用了unicode匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM