无法从语料库中删除“pgnbr”字符串数据

Question

stackoverflow-ers。 我正在为一个研究项目使用 Python 预处理文本数据语料库，我已经到了需要清除多余字符和空格的地方。 等等，由于某种原因，我放在一起的代码无法从语料库中清除pgnbr的实例。 我试过正则表达式测试，只是把字符串值留在里面，没有运气。 这似乎是一件非常直接的事情：

#convert to string
ocr = data_one['Ocr text'].to_string()

# regular expressions and cleaning tasks. 
import re
digit_pattern = '\d+'
whitespace_pattern = r'\s+'


clean = re.sub(digit_pattern, '', ocr)
clean = re.sub('\n', '', clean)
clean = re.sub('•', '', clean)
clean = re.sub('«', '', clean)
clean = re.sub('■', '', clean)
# struggling with correct syntax to remove pgnbr. 
clean = re.sub('pgnbr', '', clean)

# punctuation 
from string import punctuation
no_punct = ''.join([ch for ch in clean if ch not in punctuation])

# strip whitespace and lower. 
clean_text = re.sub(whitespace_pattern, ' ', no_punct)
clean_text = clean_text.strip().lower()

# tokenize 
from nltk.tokenize import word_tokenize
tokens = word_tokenize(clean_text)


# pgnbr string data will not go away!!! 
from collections import Counter
freq = Counter(tokens)
freq.most_common(10)

#the output of Counter
Out[225]: 
[('pgnbr', 118),
 ('the', 100),
 ('of', 68),
 ('i', 64),
 ('a', 64),
 ('to', 48),
 ('t', 38),
 ('s', 32),
 ('and', 31)]

为什么“pgnbr”如此粘人？我敢肯定有一个简单的答案。 我只是还没有找到它。，任何帮助表示赞赏。 如果被问到，我也可以设置一个代表。 谢谢大家！

Answer 1

您在删除所有标点符号和空格后正在搜索pgnbr ，因此仍然可以在结果字符串中找到p, gn, br之类的情况。

你应该把

clean = re.sub('pgnbr', '', clean)

clean_text = clean_text.strip().lower()下面的行，它应该可以工作。

无法从语料库中删除“pgnbr”字符串数据

问题描述

1 个解决方案

解决方案1
0 2021-01-04 00:38:06

无法从语料库中删除“pgnbr”字符串数据

问题描述

1 个解决方案

解决方案1 0 2021-01-04 00:38:06

解决方案1
0 2021-01-04 00:38:06