简体   繁体   English

从文本问题中删除标点符号/数字

[英]Removing punctuation/numbers from text problem

I had some code that worked fine removing punctuation/numbers using regular expressions in python, I had to change the code a bit so that a stop list worked, not particularly important. 我有一些代码可以正常使用python中的正则表达式删除标点符号/数字,我不得不更改代码,以便停止列表工作,不是特别重要。 Anyway, now the punctuation isn't being removed and quite frankly i'm stumped as to why. 无论如何,现在标点符号没有被删除,坦率地说,我很难过为什么。

import re
import nltk

# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
for word in word_list:
    word = punctuation.sub("", word)
print word_list

Any pointers on why it's not working would be great, I'm no expert in python so it's probably something ridiculously stupid. 关于它为什么不起作用的任何指针都会很棒,我不是python的专家所以它可能是一些非常愚蠢的东西。 Thanks. 谢谢。

Change 更改

for word in word_list:
    word = punctuation.sub("", word)

to

word_list = [punctuation.sub("", word) for word in word_list]    

Assignment to word in the for-loop above, simply changes the value referenced by this temporary variable. 在上面的for-loopword赋值,只需更改此临时变量引用的值。 It does not alter word_list . 它不会改变word_list

You're not updating your word list. 你没有更新你的单词列表。 Try 尝试

for i, word in enumerate(word_list):
    word_list[i] = punctuation.sub("", word)

Remember that although word starts off as a reference to the string object in the word_list , assignment rebinds the name word to the new string object returned by the sub function. 请记住,虽然word以在字符串对象的引用,开始了word_list ,分配重新绑定名称word由返回新的String对象sub功能。 It doesn't change the originally referenced object. 它不会更改最初引用的对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从文本文件中删除所有标点符号、空格和其他非字母字符,包括数字 - Removing all punctuation, spaces and other non-letter characters including numbers from a text file Python-从文本中删除一些标点符号 - Python - removing some punctuation from text Python:使用翻译方法从阅读的文本中删除标点符号 - Python: Removing punctuation from read-in text using translate method 去除标点符号的文本处理功能 - Text processing function for removing punctuation 从 string.punctuation 中删除标点符号 - Removing punctuation from string.punctuation 从python中的列表中删除标点符号 - Removing punctuation from a list in python Python - 从列表中删除标点符号 - Python - Removing punctuation from a list 正则表达式仅从文件中获取以字母开头的单词,并在 python 中删除仅包含数字和标点符号的单词 - Regular Expression to get only words from file starting with letter and removing words with only numbers and punctuation in python AttributeError:从文本中删除标点符号时,“列表”对象没有属性“翻译” - AttributeError: 'list' object has no attribute 'translate' while removing punctuation from text 删除标点符号后从文本文件中打印唯一单词列表,并找到最长的单词 - Print a list of unique words from a text file after removing punctuation, and find longest word
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM