简体   繁体   English

使用Python在大型文本文件上替换多个正则表达式字符串

[英]Multiple regex string replace on large text file using Python

I am having some very large text file on which I want to execute multiple regex based string replacement. 我有一些很大的文本文件,我想在该文件上执行多个基于正则表达式的字符串替换。 Currently I am doing it using Sublime's similar feature. 目前,我正在使用Sublime的类似功能。 However, in files larger than a GB my system is hanging. 但是,在大于GB的文件中,我的系统挂起了。

I am running some of the below matches in my sublime currently 我目前正在崇高中进行以下一些比赛

\\\\\\n - Remove all the backslash followed by newline. \\\\\\n删除所有反斜杠,然后再换行。

\\n - Remove all newlines. \\n删除所有换行符。

\\=\\\\\\" - Replace all instances of =\\" with just =" \\=\\\\\\" -更换的所有实例=\\"="

In one case, I also want to group the match and use it in the replaced text. 在一种情况下,我还想对匹配项进行分组并在替换的文本中使用它。

Some experts around me suggested writing a quick python script for the same, and performance won't be an issue. 我周围的一些专家建议为该脚本编写一个快速的python脚本,而性能不会成为问题。

With my limited python knowledge, I tried something as below: 凭借有限的python知识,我尝试了以下操作:

import pandas as pd
import numpy as np

df = pd.read_csv('story_all.csv')

output = df.str.replace('\n', '')

output.to_csv(story_done.csv, sep='\n', encoding='utf-8')

It, however, isn't working. 但是,它不起作用。 And somewhere I think, I might be overdoing. 在我认为的某个地方,我可能会做得过分。


Note: The fact the text file is CSV doesn't really matter. 注意:文本文件为CSV的事实并不重要。 I just need to execute some string replaces. 我只需要执行一些字符串替换。 The new line required by CSV is preserved while it's done. 完成后,将保留CSV所需的新行。


The error am getting is as below: 错误越来越如下:

Traceback (most recent call last): File "replace.py", line 4, in df = pd.read_csv('story_all.csv') File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 455, in _read data = parser.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1069, in read ret = self._engine.read(nrows) File "/Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py", line 1839, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows File " 追溯(最近一次通话最近):文件“ replace.py”,第4行,位于df = pd.read_csv('story_all.csv')文件“ /Users/safwan/Envs/regex/lib/python2.7/site-在parser_f中的第709行的“ packages / pandas / io / parsers.py”中,返回_read(filepath_or_buffer,kwds)文件“ /Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers”。 py”,第455行,在_read data = parser.read(nrows)文件“ /Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py”中,第1069行,在读取ret = self._engine.read(nrows)文件“ /Users/safwan/Envs/regex/lib/python2.7/site-packages/pandas/io/parsers.py”,行1839,在读取数据= self中。 _reader.read(nrows)文件“ pandas / _libs / parsers.pyx”,第902行,在pandas._libs.parsers.TextReader.read文件“ pandas / _libs / parsers.pyx”,第924行,在pandas._libs.parsers中.TextReader._read_low_memory文件“ pandas / _libs / parsers.pyx”,行978,在pandas._libs.parsers.TextReader._read_rows文件“ pandas / _libs / parsers.pyx”,行965,在pandas._libs.parsers.TextReader ._tokenize_rows文件“ pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. pandas / _libs.parsers.raise_parser_error中的pandas / _libs / parsers.pyx“行2208,pandas.errors.ParserError:标记数据出错。 C error: Expected 19 fields in line 8058, saw 65 C错误:第8058行中预期有19个字段,看到65

Example of a CSV file value: CSV文件值的示例:

id,title,name_in_english,type,water_directory_term,org_work_area_term,org_type_term,defined_state,org_location_taluka_term,org_location_state_term,org_location_village_term,org_name_term,ha_free_term,org_location_dist_term,fax,samprak_bekti,email,phoneno,website/blog,postal_address,sangathan_ke_bare_main,rajya_state,taluka_sahar,jilla_district,kisi_prakar_kaa_sangathan,name,ID,created,status
"883","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"884","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"885","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"
"886","some title","","org","lorem","ipsum","lorem","","","very large body field","","","","","admin","1","1230273749","1"

If I understand correctly you could do as below . 如果我理解正确,您可以做以下事情。
This seems to work with the data sample you shared 这似乎与您共享的数据样本一起使用

import pandas as pd

df = pd.read_csv('story_all.csv', sep=',')

# Chars to replace
chars = [
    '\n',
]

output = df.replace(chars, '', regex=True)
output.to_csv('story_done.csv', sep=',', encoding='utf-8', index=False)

I was finally able to do the required task without the help of pandas. 我终于能够在没有熊猫帮助的情况下完成所需的任务。 while the approach reads the whole file to memory, it works fairly well for files up to 1-1.5 GB on my MacBook Pro. 尽管该方法将整个文件读取到内存,但对于我的MacBook Pro上最大1-1.5 GB的文件,它的效果相当好。 It serves my purpose. 它符合我的目的。 I found the base code for this here . 我在这里找到了基本代码。

# import the modules that we need. (re is for regex)
import os, re

# set the working directory for a shortcut
os.chdir('/Users/username/Code/python/regex')

# open the source file and read it
# fh = file('org.csv', 'r')
fh = file('story_all.csv', 'r')
thetext = fh.read()
fh.close()

# create the pattern object. Note the "r". In case you're unfamiliar with Python
# this is to set the string as raw so we don't have to escape our escape characters

#match all newline followed by backslash.
p1 = re.compile(r'\n\\')
# p2 = re.compile(r'\n')
#match all newline except the one followed by digits in quotes.
p2 = re.compile(r'\n+(?!\"\d+\")')
p3 = re.compile(r'\\N')
p4 = re.compile(r'\=\\\"')




# do the replace
result = p1.sub("", thetext)
result = p2.sub("", result)
result = p3.sub("", result)
result = p4.sub('="', result)

# write the file
f_out = file('done.csv', 'w')
f_out.write(result)
f_out.close()

It's taking around 30-40 second when used against files close to 1 GB. 当处理接近1 GB的文件时,大约需要30-40秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM