简体   繁体   English

从文件中删除所有非字母数字字符,并在python中以相同的名称保存

[英]remove all nonalphanumeric characters from a file and save it with the same name in python

I have a .txt document containing a long body of text mixed with some special characters that I want to remove. 我有一个.txt文档,其中包含一长串文本以及一些要删除的特殊字符。 I want to do something like re.findAll to extract all words and save the file with the other characters filtered out. 我想做类似re.findAll的操作来提取所有单词并保存文件,并过滤掉其他字符。 How can I open the file, extract all non alphanumeric characters, and then save it with the same file name (with all spaces in place, obviously)? 我如何打开文件,提取所有非字母数字字符,然后使用相同的文件名保存(显然所有空格都留着)? Is there a better way with re.sub? re.sub有更好的方法吗?

import re
hand = open('document.txt')
for line in hand:
    line = line.rstrip()
    re.sub(r'\W+', ' ', line) 

The document looks like this: But was it a ride after! I loved all the characters mainly because everyone in the book has shades of gray and that is how real characters are supposed to be. The emotions were real and took their time to settle in and yet the story was fast paced. Definitely recommend. Hsnan Hn wqt \\'n \\'ltqT \\'nfsy w\\'`yd lhdw l~Wan b`d tlk lrHl@ lmt`b@ lmlyy\\'@ blGmwD wlbHth wlshkwk ..fy lbdy@ , `lyk `zyzy lqry lntbh qbl qr@ tlk lrwy@ l~ `d@ \\'shy , \\'wlh \\'n lrwy@ lyst rwmnsy@ \\'w drm `ks m ywHy smh , fh~ rwy@ tntmy l~ lfy\\'@ lbwlysy@ , fy\\'@ ljry\\'m wlkhywT lmtshbk@ .thny tlk l\\'shy hw \\'n tlk lrwy@ stj`lk mtsmran \\'mmh s`t Twyl@ dwn mll , wmn thm `lyk \\'n tnthy mn mshGlk wtj`l nfsk \\'syran lhdh l`ml dwn swh ..fy \\'wl~ lSfHt , stjd nfsk l tstw`bm yHdth , wstjd nfsk ttsl mn hw\\'l ? wm l`lq@ bynhm ? wlknh lbdy@ fqT , `lyk \\'n tmsk blkhywT lbdy\\'y@ wb`d dhlk \\'trk nfsk tmman dwn ltfkyr fy shy , stjd \\'nl\\'Hdth lmttly@ wlmt`qb@ \\'d@ jdhb l ymknk ltGDy `nh ..l twjd shkhSy@ ry\\'ysy@ fy tlk lrwy@ , jmy` lshkhSyt lwrd@ lhm \\'dwr mHwry@ , rytshyl twm an skwt myGyn kml abdyk , wlkn ymkn lqwl \\'n lqTr hw syd lmwqf hn fy tlk lrwy@ , fmn khllh tbd\\' l\\'Hdth wttTwr wlwlh lm wjdt lHbk@ \\'w wsyl@ tS`dl\\'mwr ..rGm \\'n tslsl l\\'Hdth dkhl lrwy@ wlntql mn lmDy l~ lHDr wl`ks hw \\'mr mrhq l \\'n dhlk \\'Df~ lmzyd mn lthr@ wltshwyq dkhl tlk lrwy@ dwn l`ml `l~ Hrq l\\'Hdth , bl`ks tj`ll\\'mr wk\\'nh \\'shbh bfsyfs ttDH m`lmh klm tkmlt mkwnth ..ltrjm@ ? l Gbr `lyh , wlm \\'sh`r wk\\'nny \\'mm `ml mtrjm mn l\\'Sl ..\\'slwb lm`lj@ mmtz , lHbk@ ldrmy@ mmtz@ , lktb@ njHt fy stGll kf@ mqwmt lktb@ lnjH@ w\\'khrjt ln `mlan mmyzan khS@ w\\'nny l \\'myl l~ fy\\'@ lrwyt lbwlysy@ wlkn l\\'mr \\'khtlf hn ..mlHwZ@ \\'khyr@ : `ndm nZrt l~ Swr@ lktb@ fy nhy@ tlk lrwy@ sh`rt wk\\'nh qtl@ 该文件看起来像这样: But was it a ride after! I loved all the characters mainly because everyone in the book has shades of gray and that is how real characters are supposed to be. The emotions were real and took their time to settle in and yet the story was fast paced. Definitely recommend. Hsnan Hn wqt \\'n \\'ltqT \\'nfsy w\\'`yd lhdw l~Wan b`d tlk lrHl@ lmt`b@ lmlyy\\'@ blGmwD wlbHth wlshkwk ..fy lbdy@ , `lyk `zyzy lqry lntbh qbl qr@ tlk lrwy@ l~ `d@ \\'shy , \\'wlh \\'n lrwy@ lyst rwmnsy@ \\'w drm `ks m ywHy smh , fh~ rwy@ tntmy l~ lfy\\'@ lbwlysy@ , fy\\'@ ljry\\'m wlkhywT lmtshbk@ .thny tlk l\\'shy hw \\'n tlk lrwy@ stj`lk mtsmran \\'mmh s`t Twyl@ dwn mll , wmn thm `lyk \\'n tnthy mn mshGlk wtj`l nfsk \\'syran lhdh l`ml dwn swh ..fy \\'wl~ lSfHt , stjd nfsk l tstw`bm yHdth , wstjd nfsk ttsl mn hw\\'l ? wm l`lq@ bynhm ? wlknh lbdy@ fqT , `lyk \\'n tmsk blkhywT lbdy\\'y@ wb`d dhlk \\'trk nfsk tmman dwn ltfkyr fy shy , stjd \\'nl\\'Hdth lmttly@ wlmt`qb@ \\'d@ jdhb l ymknk ltGDy `nh ..l twjd shkhSy@ ry\\'ysy@ fy tlk lrwy@ , jmy` lshkhSyt lwrd@ lhm \\'dwr mHwry@ , rytshyl twm an skwt myGyn kml abdyk , wlkn ymkn lqwl \\'n lqTr hw syd lmwqf hn fy tlk lrwy@ , fmn khllh tbd\\' l\\'Hdth wttTwr wlwlh lm wjdt lHbk@ \\'w wsyl@ tS`dl\\'mwr ..rGm \\'n tslsl l\\'Hdth dkhl lrwy@ wlntql mn lmDy l~ lHDr wl`ks hw \\'mr mrhq l \\'n dhlk \\'Df~ lmzyd mn lthr@ wltshwyq dkhl tlk lrwy@ dwn l`ml `l~ Hrq l\\'Hdth , bl`ks tj`ll\\'mr wk\\'nh \\'shbh bfsyfs ttDH m`lmh klm tkmlt mkwnth ..ltrjm@ ? l Gbr `lyh , wlm \\'sh`r wk\\'nny \\'mm `ml mtrjm mn l\\'Sl ..\\'slwb lm`lj@ mmtz , lHbk@ ldrmy@ mmtz@ , lktb@ njHt fy stGll kf@ mqwmt lktb@ lnjH@ w\\'khrjt ln `mlan mmyzan khS@ w\\'nny l \\'myl l~ fy\\'@ lrwyt lbwlysy@ wlkn l\\'mr \\'khtlf hn ..mlHwZ@ \\'khyr@ : `ndm nZrt l~ Swr@ lktb@ fy nhy@ tlk lrwy@ sh`rt wk\\'nh qtl@ But was it a ride after! I loved all the characters mainly because everyone in the book has shades of gray and that is how real characters are supposed to be. The emotions were real and took their time to settle in and yet the story was fast paced. Definitely recommend. Hsnan Hn wqt \\'n \\'ltqT \\'nfsy w\\'`yd lhdw l~Wan b`d tlk lrHl@ lmt`b@ lmlyy\\'@ blGmwD wlbHth wlshkwk ..fy lbdy@ , `lyk `zyzy lqry lntbh qbl qr@ tlk lrwy@ l~ `d@ \\'shy , \\'wlh \\'n lrwy@ lyst rwmnsy@ \\'w drm `ks m ywHy smh , fh~ rwy@ tntmy l~ lfy\\'@ lbwlysy@ , fy\\'@ ljry\\'m wlkhywT lmtshbk@ .thny tlk l\\'shy hw \\'n tlk lrwy@ stj`lk mtsmran \\'mmh s`t Twyl@ dwn mll , wmn thm `lyk \\'n tnthy mn mshGlk wtj`l nfsk \\'syran lhdh l`ml dwn swh ..fy \\'wl~ lSfHt , stjd nfsk l tstw`bm yHdth , wstjd nfsk ttsl mn hw\\'l ? wm l`lq@ bynhm ? wlknh lbdy@ fqT , `lyk \\'n tmsk blkhywT lbdy\\'y@ wb`d dhlk \\'trk nfsk tmman dwn ltfkyr fy shy , stjd \\'nl\\'Hdth lmttly@ wlmt`qb@ \\'d@ jdhb l ymknk ltGDy `nh ..l twjd shkhSy@ ry\\'ysy@ fy tlk lrwy@ , jmy` lshkhSyt lwrd@ lhm \\'dwr mHwry@ , rytshyl twm an skwt myGyn kml abdyk , wlkn ymkn lqwl \\'n lqTr hw syd lmwqf hn fy tlk lrwy@ , fmn khllh tbd\\' l\\'Hdth wttTwr wlwlh lm wjdt lHbk@ \\'w wsyl@ tS`dl\\'mwr ..rGm \\'n tslsl l\\'Hdth dkhl lrwy@ wlntql mn lmDy l~ lHDr wl`ks hw \\'mr mrhq l \\'n dhlk \\'Df~ lmzyd mn lthr@ wltshwyq dkhl tlk lrwy@ dwn l`ml `l~ Hrq l\\'Hdth , bl`ks tj`ll\\'mr wk\\'nh \\'shbh bfsyfs ttDH m`lmh klm tkmlt mkwnth ..ltrjm@ ? l Gbr `lyh , wlm \\'sh`r wk\\'nny \\'mm `ml mtrjm mn l\\'Sl ..\\'slwb lm`lj@ mmtz , lHbk@ ldrmy@ mmtz@ , lktb@ njHt fy stGll kf@ mqwmt lktb@ lnjH@ w\\'khrjt ln `mlan mmyzan khS@ w\\'nny l \\'myl l~ fy\\'@ lrwyt lbwlysy@ wlkn l\\'mr \\'khtlf hn ..mlHwZ@ \\'khyr@ : `ndm nZrt l~ Swr@ lktb@ fy nhy@ tlk lrwy@ sh`rt wk\\'nh qtl@

I want to delete all special characters, punctuation, and leave only the [a-zA-Z0-9] characters. 我想删除所有特殊字符,标点符号,仅保留[a-zA-Z0-9]字符。

\\W is the metacharacter signifying any non-alphanumeric character. \\W是表示任何非字母数字字符的元字符。

Create a list to store your lines that have been stripped of non-alpha/num characters, and then write these lines back to the same file. 创建一个列表来存储已去除非字母/数字字符的行,然后将这些行写回到同一文件。

import re

with open('document.txt') as hand:
    lines = []
    for line in hand:
        lines.append(re.sub("[\W]", "", line))

with open('document.txt', 'w') as hand:
    for line in lines:
        hand.write(line)

Output: 输出:

To keep the spaces: 要保留空间:

re.sub("[^\s\w]+", "", line)

You can str.translate if you only have ascii in your file and write to a tempfile , then replace the original using shutil.move after writing.: 如果文件中仅包含ascii并写入tempfile ,则可以进行shutil.move ,然后在写入后使用shutil.move替换原始tempfile

from tempfile import NamedTemporaryFile
from shutil import move


with open("document.txt") as f, NamedTemporaryFile(dir=".", delete=False) as tmp:
   _del =  "".join(filter(lambda x: not x.isalnum(), map(chr, range(256)))).replace(" ", "")
   for line in f:    
        tmp.write(line.translate(None, _del))

move(tmp.name, "document.txt")

An example using translate on a snippet of your data: 在您的数据片段上使用translation的示例:

 In [31]: s = '''The emotions were real and took their time to settle
 in and yet the story was fast paced. Definitely recommend.
 Hsnan Hn wqt \'n \'ltqT \'nfsy w\'`yd lhdw l~Wan b`d tlk lrHl@ lmt`b@ 
lmlyy\'@ blGmwD wlbHth wlshkwk ..fy lbdy@ , `lyk `zyzy lqry lntbh qbl 
qr@ tlk lrwy@ l~ `d@ \'shy , \'wlh \'n lrwy@ lyst rwmnsy@ \'w drm 
`ks m ywHy smh , fh~ rwy@ tntmy l~ lfy\'@ lbwlysy@ , fy\'@ ljry\'m wlkhywT lmtshbk@ .thny tlk l\'shy hw \'n tlk lrwy@ stj`lk mtsmran \'mmh s`t Twyl@ dwn mll '''

In [32]: s.translate(None, _del)
Out[32]: 'The emotions were real and took their time to settle in and
 yet the story was fast paced Definitely recommend Hsnan Hn wqt n ltqT
 nfsy wyd lhdw lWan bd tlk lrHl lmtb lmlyy blGmwD wlbHth wlshkwk 
fy lbdy  lyk zyzy lqry lntbh qbl qr tlk lrwy l d shy  wlh n lrwy lyst
 rwmnsy w drm ks m ywHy smh  fh rwy tntmy l lfy lbwlysy  fy ljrym 
wlkhywT lmtshbk thny tlk lshy hw n tlk lrwy stjlk mtsmran mmh st Twyl 
dwn mll '

Are you looking for something like this: 您是否正在寻找这样的东西:

re.sub(r'[^\w' + string.printable + ']', '', some_text_string)

File: 文件:

This is alpha numeric.
This is nøt alphanumeric.

Read into interpreter: 阅读解释器:

This is alpha numeric.\nThis is n\xc3\xb8t alphanumeric.

After re.sub: 重新订阅后:

This is alpha numeric.\nThis is nt alphanumeric.

EDIT: Saw your edit. 编辑:看到您的编辑。 How about: 怎么样:

re.sub(r'[^\wA-Za-z0-9 ]', '', text)

Your text is now: 您的文字现在是:

But was it a ride after I loved all the characters mainly because everyone in the book has shades of gray and that is how real characters are supposed to be The emotions were real and took their time to settle in and yet the story was fast paced Definitely recommend Hsnan Hn wqt n ltqT nfsy wyd lhdw lWan bd tlk lrHl lmtb lmlyy blGmwD wlbHth wlshkwk fy lbdy lyk zyzy lqry lntbh qbl qr tlk lrwy ld shy wlh n lrwy lyst rwmnsy w drm ks m ywHy smh fh rwy tntmy l lfy lbwlysy fy ljrym wlkhywT lmtshbk thny tlk lshy hw n tlk lrwy stjlk mtsmran mmh st Twyl dwn mll wmn thm lyk n tnthy mn mshGlk wtjl nfsk syran lhdh lml dwn swh fy wl lSfHt stjd nfsk l tstwb m yHdth wstjd nfsk ttsl mn hwl wm llq bynhm wlknh lbdy fqT lyk n tmsk blkhywT lbdyy wbd dhlk trk nfsk tmman dwn ltfkyr fy shy stjd n lHdth lmttly wlmtqb d jdhb l ymknk ltGDy nh l twjd shkhSy ryysy fy tlk lrwy jmy lshkhSyt lwrd lhm dwr mHwry rytshyl twm an skwt myGyn kml abdyk wlkn ymkn lqwl n lqTr hw syd lmwqf hn fy tlk lrwy fmn khllh tbd lHdth wttTwr wlwlh lm wjdt lHbk w wsyl tSd lmwr rGm n tslsl lHdth dkhl lrwy wlntql mn lmDy l lHDr wlks hw mr mrhq ln dhlk Df lmzyd mn lthr wltshwyq dkhl tlk lrwy dwn lml l Hrq lHdth blks tjl lmr wknh shbh bfsyfs ttDH mlmh klm tkmlt mkwnth ltrjm l Gbr lyh wlm shr wknny mm ml mtrjm mn lSl slwb lmlj mmtz lHbk ldrmy mmtz lktb njHt fy stGll kf mqwmt lktb lnjH wkhrjt ln mlan mmyzan khS wnny l myl l fy lrwyt lbwlysy wlkn lmr khtlf hn mlHwZ khyr ndm nZrt l Swr lktb fy nhy tlk lrwy shrt wknh qtl

Using the re.sub function and the the \\W meta-character. 使用re.sub函数和\\W元字符。 You can open your file for reading and writing using the r+ option. 您可以使用r+选项打开文件以进行读写。

import re
import io
with open('document.txt', 'r+') as f:
    buf = io.StringIO(re.sub('\W', ' ', f.read()))
    f.seek(0)
    buf.seek(0)
    f.write(buf.getvalue())

After your file's content looks like this: 文件内容如下所示:

But was it a ride after I loved all the characters mainly because everyone in the book has shades of gray and that is how real characters are supposed to be The emotions were real and took their time to settle in and yet the story was fast paced Definitely recommend Hsnan Hn wqt n ltqT nfsy w yd lhdw l Wan bd tlk lrHl lmt b lmlyy blGmwD wlbHth wlshkwk fy lbdy lyk zyzy lqry lntbh qbl qr tlk lrwy ld shy wlh n lrwy lyst rwmnsy w drm ks m ywHy smh fh rwy tntmy l lfy lbwlysy fy ljry m wlkhywT lmtshbk thny tlk l shy hw n tlk lrwy stj lk mtsmran mmh st Twyl dwn mll wmn thm lyk n tnthy mn mshGlk wtj l nfsk syran lhdh l ml dwn swh fy wl lSfHt stjd nfsk l tstw bm yHdth wstjd nfsk ttsl mn hw l wm l lq bynhm wlknh lbdy fqT lyk n tmsk blkhywT lbdy y wb d dhlk trk nfsk tmman dwn 但这是我爱上了所有角色之后的一次旅程,主要是因为书中的每个人都有灰色阴影,这才是真实角色应该被认为的样子。情感是真实的,并花了一些时间来解决,但故事的节奏是肯定的。推荐Hsnan Hn wqt n ltqT nfsy w yd lhdw l Wan bd tlk lrHl lmt b lmlyy blGmwD wlbHth wlshkwk fy lbdy lyk zyzy lqry lntbh qbl qr tlk lrwy ldy yyhh w lh wy fy ljry m wlkhywT lmtshbk thny tlk l shy hw n tlk lrwy stj lk mtsmran mmh st Twyl dwn mll wmn thm lyk n tnthy mn mshGlk wtj l nfsk syran lhdh l ml dwn swh fy wl nstf lnfsk tst hw l wm l lq bynhm wlknh lbdy fqT lyk n tmsk blkhywT lbdy y wb d dhlk trk nfsk tmman dwn

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python中删除非字母数字字符但保留一些特殊字符 - How to remove nonalphanumeric character in python but keep some special characters 使用 python 从目录中的每个文件名中删除字符 - remove characters from every file name in directory with python 在Python中删除文件名上的某些字符模式 - Remove certain pattern of characters on a file name in Python 如何从文件名或字符串中删除所有数字-Python3 - How to remove all numbers from a file name or a string - Python3 从Python中删除字符串中的所有十六进制字符 - Remove all hex characters from string in Python 如何从python中的文本文件中删除所有带有大写字母和数字和特殊字符的行以及所有长度超过10个字符的行 - How to remove all lines with caps AND digits AND special characters AND all the lines longer than 10 characters from a text file in python 从python中的excel文件中删除波浪号和字符 - Remove tildes and characters from excel file in python 使用python从pandas数据帧列中删除非法文件名字符 - Remove illegal file name characters from pandas dataframe column using python 删除列表中所有相同的字符 - Remove all same characters in list 如何从字符串中删除字符并将其保存在列表中? - python - How to remove characters from a string and save it in a list? - python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM