How to substitute specific patterns in python

Question

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:

<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question"  <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.

I want to replace the original data by the below mentioned data:

 <stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
 "Ask a Question"  <at> "255"^^<int> <stack_overflow> .
 <basic> "language" "89028899" <html>.

The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .

What I want to do is: for numbers greater than 2147483647 eg 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.

Answer 1

Use the re module to construct a regular expression:

regex = r"""
(                # Capture in group #1
    "[\w\s]+"    # Three sequences of quoted letters and white space characters
    \s+          # followed by one or more white space characters
    "[\w\s]+"
    \s+
    "[\w\s]+"
    \s+
)
"(\d{10,})"      # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+)     # Match by two circumflex characters, whitespace and a period
                 # into group #3
(.*)             # Followed by anything at all into group #4
"""

COMPILED_REGEX = re.compile(regex, re.VERBOSE)

Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:

def replace_callback(matches):
    full_line = matches.group(0)
    number_text = matches.group(2)
    number_of_interest = int(number_text, base=10)
    if number_of_interest > 2147483647:
        return full_line.replace(number_of_interest, number_text[:3])
    else:
        return full_line

And then find and replace:

fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)

If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:

# Given the above
def process_data():
    with open("path/to/your/file") as data_file,
         open("path/to/output/file", "w") as output_file:
         for line in data_file:
             fixed_data = COMPILED_REGEX.sub(replace_callback, line)
             output_file.write(fixed_data)

Answer 2

If each line in your text file looks like your example, then you can do this:

In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'

In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    for line in infile:
        for found in re.findall('\d+"\^\^', line):
            if int(found[:-3]) > 2147483647:
                line = line.replace(found, found[:3])
        outfile.write(line)

Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

How to substitute specific patterns in python

Question

2 answers

solution1
3 ACCPTED 2013-07-27 01:26:02

solution2
1 2013-07-27 01:06:25

How to substitute specific patterns in python

Question

2 answers

solution1 3 ACCPTED 2013-07-27 01:26:02

solution2 1 2013-07-27 01:06:25

solution1
3 ACCPTED 2013-07-27 01:26:02

solution2
1 2013-07-27 01:06:25