简体   繁体   中英

Read and select specific rows from text file regex Python

I have a large number of text files to read from in Python. Each file is structured as the following sample:

------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT   (27kb)

Title: some_title 
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
  blablabla (this is a multiline abstract of the paper)
  blablabla
  blablabla
\\

I would like to automatically extract and store (eg, as a list) the Title , Authors , and abstract (the text between the second and third \\\\ - note that it starts with an indent) from each text file. Also note that the white line between Date (revised) and Title is really there (it is not a typo that I introduced).

My attempts so far have involved (I am showing the steps for a single text file, say the first file in the list):

filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")

Which gives me:

                                                0
0   ----------------------------------------------...
1                                                  \\
2                                 Paper: some_integer
3                          From: <some_email_address>
4         Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
5    Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6                                Title: some_title...
7                             Authors: name_1, name_2
8                      Comments: 28 pages, JHEP latex
9                          Report-no: DUKE-CGTP-00-01
10                                                 \\
11                                          blabla...
12                                          blabla...
13                                          blabla...
14                                                 \\

I can then select a given row (eg, the one featuring the title) with:

test[test[0].str.contains("Title")].to_string()

But it is truncated, it is not a clean string (some attributes show up) and I find this entire pandas-based approach quite tedious actually... There must be an easier way to directly select the rows of interest from the text file using regex. At least I hope so...

How about iterating over each line in the file and split by the first : if it is present in line, collecting the result of the split in a dictionary:

with open("input.txt") as f:
    data = dict(line.strip().split(": ", 1) for line in f if ": " in line)

As a result, the data would contain:

{
    'Comments': '28 pages, JHEP latex', 
    'Paper': 'some_integer', 
    'From': '<some_email_address>', 
    'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT   (27kb)', 
    'Title': 'some_title', 
    'Date': 'Wed, 4 Apr 2001 12:08:13 GMT   (27kb)', 
    'Authors': 'name_1, name_2'
}

If your files really always have the same structure, you could come up with:

# -*- coding: utf-8> -*-
import re

string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT   (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT   (27kb)

Title: some_title 
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
  blablabla (this is the abstract of the paper)
\\
"""

rx = re.compile(r"""
    ^Title:\s(?P<title>.+)[\n\r]        # Title at the beginning of a line
    Authors:\s(?P<authors>.+)[\n\r]     # Authors: ...
    Comments:\s(?P<comments>.+)[\n\r]   # ... and so on ...
    .*[\n\r]
    (?P<abstract>.+)""", 
    re.MULTILINE|re.VERBOSE)            # so that the caret matches any line
                                        # + verbose for this explanation

for match in rx.finditer(string):
    print match.group('title'), match.group('authors'), match.group('abstract')
    # some_title  name_1, name_2   blablabla (this is the abstract of the paper)

This approach takes Title as the anchor (beginning of a line) and skims the text afterwards. The named groups may not really be necessary but make the code easier to understand. The pattern [\\n\\r] looks for newline characters.
See a demo on regex101.com .

you could process line by line.

import re
data = {}
temp_s = match = ''
with open('myfile.txt', 'r') as infile:
     for line in infile:
          if ":" in line:
               line = line.split(':')
               data[line[0]] = line[1]
          elif re.search(r'.*\w+', line):
               match = re.search(r'(\w.*)', line)
               match = match.group(1)
               temp_s += match
               while 1:
                    line = infile.next()
                    if re.search(r'.*\w+', line):
                         match = re.search(r'(\w.*)', line)
                         temp_s += match.group(1)
                    else:
                         break
               data['abstract'] = temp_s

This pattern will get you started:

\\\\[^\\\\].*[^\\\\]+Title:\\s+(\\S+)\\s+Authors:\\s+(.*)[^\\\\]+\\\\+\\s+([^\\\\]*)\\n\\\\

Assume 'txtfile.txt' is of the format as shown on the top. If using python 2.7x:

import re
with open('txtfile.txt', 'r') as f:
    input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)

Output:

[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n  blablabla\n  blablabla')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM