简体   繁体   English

计算文件中的单词时,从单词的开头和结尾删除特殊字符

[英]Remove special characters from the start and end of a word while counting the words in a file

I need to count words in a huge text file but before that, I have to clean the file of special characters in a specific way. 我需要计算巨大文本文件中的单词,但是在此之前,我必须以特定方式清理特殊字符文件。

For example - 例如 -

;xyz        -->      xyz      
xyz:        -->     xyz          
xyz!)       -->     xyz!

I am using flatMap() to split all the words on space. 我正在使用flatMap()在空间上拆分所有单词。 And then I am trying to remove the special characters which is not working. 然后,我试图删除不起作用的特殊字符。 Please help! 请帮忙!

Here is the code I am using --- 这是我正在使用的代码-

The characters to remove are - : ; 要删除的字符是-:; ! ? ( ) . ()。

   >>> input = sc.textFile("file:///home/<...>/Downloads/file.txt")
   >>> input2 = input.flatMap(lambda x: x.split())
   >>> def remove(x):
           if x.endsWith(':'):
                x.replace(':','')
                return x
           elif x.endsWith('.'):
               x.replace('.','')
               return x

. .

      >>> input3 = input2.map(lambda x: remove(x))

Use re.sub 使用re.sub

re.sub(r'(?<!\S)[^\s\w]+|[^\s\w]+(?!\S)', '', f.read())

DEMO 演示

You can write a function that sees if a character is valid, then use filter() : 您可以编写一个函数来查看字符是否有效,然后使用filter()

def is_valid(char):
    return char.isalpha() or char in "!,." # Whatever extras you want to include

new_string = ''.join(filter(is_valid, old_string)) # No need to ''.join() in Python 2

Try getting help of regex: 尝试获得正则表达式的帮助:

import re

with open('input.txt','r') as fp:
    rx = "[;:\)]+"
    for line in fp:
        data = re.sub(rx, "", line.strip())
        print(data)

Code above will read file line by line and emit sanitized content. 上面的代码将逐行读取文件并发出经过清理的内容。 Depending on content of file it will print: 根据文件内容,它将打印:

xyz
xyz
xyz!

This is the code that worked for me- 这是对我有用的代码-
def removefromstart(x): def removefromstart(x):
... for i in [':','!','?','.',')','(',';',',']: ...对于[[::,'!','?','。',')','(',';',',']]中的i:
... if x.startswith(i): ...如果x.startswith(i):
... token = x.replace(i,'') ...令牌= x.replace(i,'')
... return token ...返回令牌
... return x ...返回x
... ...

 def removefromend(x): ... for i in [':','!','?','.',')','(',';',',']: ... if x.endswith(i): ... token = x.replace(i,'') ... return token ... return x 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从字符串中每个单词的末尾删除特殊字符? - how remove special characters from the end of every word in a string? 从字符串的开头或结尾删除特殊字符的正则表达式 - Regular expression to remove special characters from start or end of a string 匹配序列的开始和结束索引从索引计数中跳过“-”或特殊字符 - Start and end indices of the matched sequences skipping the '-' or special characters from index counting 在 Pyspark 中包含特殊字符并忽略大小写的同时计算单词? - Counting Words while including special characters and disregarding capitilization in Pyspark? 从字符串中删除以特定字符开头的单词 - Remove word from string start with specific characters 如何使用python删除文件中的特殊字符和停用词? - How to remove special characters and stop words in a file using python? 如何在python中从字符串的开头和结尾去除特殊字符 - How to strip special characters from the start and end of the string in python 在Python中优先计算文件中的行,字符和单词的方法 - Preffered way of counting lines, characters and words from a file as a whole in Python Python:在读取文件并计算一行中的单词时,我想将“”或“”之间的单词计算为一个单词 - Python: While reading the file and counting the words in a line, I want to count words coming between " " or ' ' as a single word 唯一单词字典删除特殊字符和数字 - unique words dictionary remove special characters and numbers
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM