[英]Remove special characters from the start and end of a word while counting the words in a file
I need to count words in a huge text file but before that, I have to clean the file of special characters in a specific way. 我需要计算巨大文本文件中的单词,但是在此之前,我必须以特定方式清理特殊字符文件。
For example - 例如 -
;xyz --> xyz
xyz: --> xyz
xyz!) --> xyz!
I am using flatMap() to split all the words on space. 我正在使用flatMap()在空间上拆分所有单词。 And then I am trying to remove the special characters which is not working.
然后,我试图删除不起作用的特殊字符。 Please help!
请帮忙!
Here is the code I am using --- 这是我正在使用的代码-
The characters to remove are - : ; 要删除的字符是-:; !
! ?
? ( ) .
()。
>>> input = sc.textFile("file:///home/<...>/Downloads/file.txt")
>>> input2 = input.flatMap(lambda x: x.split())
>>> def remove(x):
if x.endsWith(':'):
x.replace(':','')
return x
elif x.endsWith('.'):
x.replace('.','')
return x
. 。 .
。
>>> input3 = input2.map(lambda x: remove(x))
You can write a function that sees if a character is valid, then use filter()
: 您可以编写一个函数来查看字符是否有效,然后使用
filter()
:
def is_valid(char):
return char.isalpha() or char in "!,." # Whatever extras you want to include
new_string = ''.join(filter(is_valid, old_string)) # No need to ''.join() in Python 2
Try getting help of regex: 尝试获得正则表达式的帮助:
import re
with open('input.txt','r') as fp:
rx = "[;:\)]+"
for line in fp:
data = re.sub(rx, "", line.strip())
print(data)
Code above will read file line by line and emit sanitized content. 上面的代码将逐行读取文件并发出经过清理的内容。 Depending on content of file it will print:
根据文件内容,它将打印:
xyz
xyz
xyz!
This is the code that worked for me- 这是对我有用的代码-
def removefromstart(x): def removefromstart(x):
... for i in [':','!','?','.',')','(',';',',']: ...对于[[::,'!','?','。',')','(',';',',']]中的i:
... if x.startswith(i): ...如果x.startswith(i):
... token = x.replace(i,'') ...令牌= x.replace(i,'')
... return token ...返回令牌
... return x ...返回x
... ...
def removefromend(x): ... for i in [':','!','?','.',')','(',';',',']: ... if x.endswith(i): ... token = x.replace(i,'') ... return token ... return x
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.