I have a large number of text files to read from in Python. Each file is structured as the following sample:
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is a multiline abstract of the paper)
blablabla
blablabla
\\
I would like to automatically extract and store (eg, as a list) the Title
, Authors
, and abstract (the text between the second and third \\\\
- note that it starts with an indent) from each text file. Also note that the white line between Date (revised)
and Title
is really there (it is not a typo that I introduced).
My attempts so far have involved (I am showing the steps for a single text file, say the first file in the list):
filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")
Which gives me:
0
0 ----------------------------------------------...
1 \\
2 Paper: some_integer
3 From: <some_email_address>
4 Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
5 Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6 Title: some_title...
7 Authors: name_1, name_2
8 Comments: 28 pages, JHEP latex
9 Report-no: DUKE-CGTP-00-01
10 \\
11 blabla...
12 blabla...
13 blabla...
14 \\
I can then select a given row (eg, the one featuring the title) with:
test[test[0].str.contains("Title")].to_string()
But it is truncated, it is not a clean string (some attributes show up) and I find this entire pandas-based approach quite tedious actually... There must be an easier way to directly select the rows of interest from the text file using regex. At least I hope so...
How about iterating over each line in the file and split by the first :
if it is present in line, collecting the result of the split in a dictionary:
with open("input.txt") as f:
data = dict(line.strip().split(": ", 1) for line in f if ": " in line)
As a result, the data
would contain:
{
'Comments': '28 pages, JHEP latex',
'Paper': 'some_integer',
'From': '<some_email_address>',
'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT (27kb)',
'Title': 'some_title',
'Date': 'Wed, 4 Apr 2001 12:08:13 GMT (27kb)',
'Authors': 'name_1, name_2'
}
If your files really always have the same structure, you could come up with:
# -*- coding: utf-8> -*-
import re
string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is the abstract of the paper)
\\
"""
rx = re.compile(r"""
^Title:\s(?P<title>.+)[\n\r] # Title at the beginning of a line
Authors:\s(?P<authors>.+)[\n\r] # Authors: ...
Comments:\s(?P<comments>.+)[\n\r] # ... and so on ...
.*[\n\r]
(?P<abstract>.+)""",
re.MULTILINE|re.VERBOSE) # so that the caret matches any line
# + verbose for this explanation
for match in rx.finditer(string):
print match.group('title'), match.group('authors'), match.group('abstract')
# some_title name_1, name_2 blablabla (this is the abstract of the paper)
This approach takes Title
as the anchor (beginning of a line) and skims the text afterwards. The named groups may not really be necessary but make the code easier to understand. The pattern [\\n\\r]
looks for newline characters.
See a demo on regex101.com .
you could process line by line.
import re
data = {}
temp_s = match = ''
with open('myfile.txt', 'r') as infile:
for line in infile:
if ":" in line:
line = line.split(':')
data[line[0]] = line[1]
elif re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
match = match.group(1)
temp_s += match
while 1:
line = infile.next()
if re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
temp_s += match.group(1)
else:
break
data['abstract'] = temp_s
This pattern will get you started:
\\\\[^\\\\].*[^\\\\]+Title:\\s+(\\S+)\\s+Authors:\\s+(.*)[^\\\\]+\\\\+\\s+([^\\\\]*)\\n\\\\
Assume 'txtfile.txt' is of the format as shown on the top. If using python 2.7x:
import re
with open('txtfile.txt', 'r') as f:
input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)
Output:
[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n blablabla\n blablabla')]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.