简体   繁体   中英

Check whether text file of values follows convention using regex

Good day all!

I'm writing a python script to parse text files containing exactly 2 columns of integers, space or tab separated, something like this example:

3141 5926
535 89
79 32

11 2
1 4

I want to be able to reject a file from the get go if it doesn't follow this convention (eg 1 or 3 values or more in one line, letters...)

So far I came up with ^\d+[ \t]+\d+$ which is arguably not much (I tried different approaches to no avail, I'm not super familiar with regex unfortunately). I was thinking of writing an expression that will either return a match or none in case the file doesn't follow convention.

My questions are:

  1. Is regex even the right tool or are conventional methods of reading the file and manipulating strings better?
  2. Where do I go from here? Is my approach of all-or-nothing even worth it?
  3. Is there a way to not only match the whole text file but also be able to extract the last paragraph?

I'm working on python3 using re .

Any pointers are appreciated!

You could use pandas to load the text files into a dataframe with read_csv and then check whether all values are integers and whether the number of columns in 2 or not:

import pandas as pd
from glob import glob

files = glob('/path/to/files/*.txt') #get a list of all txt files

for i in files:
    df = pd.read_csv(i, sep=' |\t', engine='python', header=None) #sep=' |\t' will process both space- and tab-delimited files
    if (df.dtypes == 'int64').all() and len(df.columns) == 2: #check if all values are integers and if the number of columns is 2
        #do something here

How to read only the last paragraph depends on how the files are structured, if the second paragraph is always the 4th row you could access it with df[3:] . If there is no pattern you could extract the second paragraph like this:

with open('filename.txt') as file:
    data = [[int(x) for x in i.strip().split()] for i in file.readlines()] #create list of lists of items in rows
    data = data[data.index([])+1:] #slice list after the empty row

I managed to do what I initially set out to do.

I used the following pattern:

\n*(?:(?:[ \t]*\d+[ \t]+\d+[ \t]*\n)*(?:[ \t]*\d+[ \t]+\d+[ \t]*)\n\n)*(?P<last>(?:[ \t]*\d+[ \t]+\d+[ \t]*\n)*(?:[ \t]*\d+[ \t]+\d+[ \t\n]*))

which matches a text containing pairs of integers (one pair per line) separated by either spaces or tabs. The pattern also allows for a single white-line between groups of pairs (like in my example). It also captures the last bunch of pairs in a group named last .

Now, this will match with a partially-compliant file, which I don't want. The trick was to use re.fullmatch() instead of re.match() . This method returns None if there's only a partial match.

The pattern above does the following:

  • \n* matches with leading line-breaks.
  • (?:(?:[ \t]*\d+[ \t]+\d+[ \t]*\n)*(?:[ \t]*\d+[ \t]+\d+[ \t]*)\n\n)* matches with all except the last bunch of pairs (hence the * at the end):
    • (?:[ \t]*\d+[ \t]+\d+[ \t]*\n)* matches between 0 and unlimited times with single-line pairs or integers, allowing leading and trailing spaces/tabs
    • (?:[ \t]*\d+[ \t]+\d+[ \t]*)\n\n matches exactly once with the last line of a bunch of pairs. This, effectively serves as a base-case for the repeating pattern above.
  • (?P<last>(?:[ \t]*\d+[ \t]+\d+[ \t]*\n)*(?:[ \t]*\d+[ \t]+\d+[ \t]*)) follows the same rationale discussed above where it serves as a base-case to the last repeating non-capturing group. It allows for an unlimited amount of trailing while-lines
  • \n* matches with trailing line-breaks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM