简体   繁体   English

如何从熊猫系列中的字符串中删除标点符号

[英]How do I remove punctuation from a string in a pandas Series

I am trying to remove punctuation from a pandas Series.我正在尝试从熊猫系列中删除标点符号。 My problem is that I am unable to iterate over all the lines in the Series.我的问题是我无法遍历系列中的所有行。 This is the code that I tried out but it is taking forever to run.这是我尝试过的代码,但它需要很长时间才能运行。 Note that my dataset is a bit large, around 112MB(200,000 rows)请注意,我的数据集有点大,大约 112MB(200,000 行)

import pandas as pd
import string

df = pd.read_csv('let us see.csv')
s = set(string.punctuation)

for st in df.reviewText.str:
    for j in s:
        if j in st:
            df.reviewText = df.reviewText.str.replace(j, '')

df.reviewText = df.reviewText.str.lower()
df['clean_review'] = df.reviewText
print(df.clean_review.tail())

DEN's answer is pretty good. DEN 的回答非常好。 I just add another solution of how to improve the performance of your code.我只是添加了另一个关于如何提高代码性能的解决方案。 Iteraring over a list version of your series should work faster than your approach.迭代系列的列表版本应该比您的方法更快。

import pandas as pd
import string

def replace_chars(text, chars):
    for c in chars:
        text = text.replace(c, '')
    return text.lower()

df = pd.read_csv('let us see.csv')
s = set(string.punctuation)

reviewTextList = df.reviewText.astype(str).tolist()
reviewTextList = [replace_chars(x, s) for x in reviewTextList]

df['clean_review'] = reviewTextList
print(df.clean_review.tail())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM