Python Dataframe - Extract multiple lines between regex match

Question

I'm working on a python 3.x project that needs to read a large TXT file that need to be filtered (for example, remove multiple spaces, blank lines, lines that start with certain strings, etc) and finally split by REGEX matching.

What I am doing right now is using pandas dataframe to store each line (which make it easy to delete lines using pandas startswith() or endswith()). On the other hand by having each line of the text file corresponding to a row in DataFrame I can't figure out on how to extract data between REGEX matches. Here is an example:

| 0 | REGEX MATCH   |
| 1 | data          |
| 2 | data          |
| 3 | REGEX MATCH   |
| 4 | data          |
| 5 | REGEX MATCH   |

So the question is how can I extract data between matches (in this example, rows 0 to 2; 3 to 4 and 5). Is this even possible in pandas?

Another option is to use read() from text file and going for regular string manipulation instad of DataFrame, filtering, spliting, etc, which I'm not sure if it is apropriate for big text files. In that case I have unwanted data between REGEX matches. Example:

str = "This is REGEX_MATCH    while between another \n \n\ REGEX_MATCH there is some    unwanted data"

In the above, I would need to remove extra blank space, \\n and finally using REGEX to split matches. The only issue is that my source text file is really large.

Pandas is fast on deletion/filtering while regular string is easier on spliting.

Any ideas?

Thanks!

EDIT. Here is how my source text looks like. Its a mess as you can see (extracted from PDF). Each line is a row in pandas dataframe. Question is if it is possible to extract all the data between those lines containing a series of numbers (including those lines).

13 - 0005761-52.2014.4.02.5101                 Lorem ipsum dolor sit amet.
Quisque eget velit a orci consectetur pharetra. Aliquam.
\n
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
a
Lorem ipsum dolor sit amet.
        Lorem ipsum dolor sit amet - Sed ut tempus neque.
Sed ut tempus neque.
2 - 0117333-76.2015.4.02.5101 Lorem ipsum dolor sit amet

Answer 1

You could read it all into a DataFrame using and select rows that don't contain the match:

import pandas as pd

df = pd.read_csv('test.txt', header=None, delimiter='|') 
df = df[df[2].str.contains('MATCH') == False]  # check column 2 from the example

Alternatively, you could find the lines you wanted to ignore then use the skiprows argument for pandas.read_csv :

with open('test.txt') as f:
    lines = f.readlines()

skiprows = [i for i, line in enumerate(lines) if 'MATCH' in line]
df = pd.read_csv('test.txt', skiprows=skiprows, header=None, delimiter='|')

To drop columns by column number if they are unwanted or empty:

df = df.drop(df.columns[[0, 1, 3]], axis=1)

To clean extra whitespace from all the values in column 2:

df[2] = [' '.join(x.split()) for x in df[2]]

Or to clean the whitespace across the entire DataFrame:

cleaner = lambda x: ' '.join(x.split()) if isinstance(x, str) else x
df = df.applymap(cleaner)

Python Dataframe - Extract multiple lines between regex match

Question

1 answers

solution1
1 2017-09-20 06:39:09

Python Dataframe - Extract multiple lines between regex match

Question

1 answers

solution1 1 2017-09-20 06:39:09

solution1
1 2017-09-20 06:39:09