Python: parsing texts in a .txt file

Question

I have a text file like this.

1       firm A         Manhattan (company name)     25,000 
                       SK Ventures                  25,000
                       AEA investors                10,000 
2       firm B         Tencent collaboration        16,000 
                       id TechVentures              4,000 
3       firm C         xxx                          625 
(and so on)

I want to make a matrix form and put each item into the matrix. For example, the first row of matrix would be like:

[[1,Firm A,Manhattan,25,000],['','',SK Ventures,25,000],['','',AEA investors,10,000]]

or,

[[1,'',''],[Firm A,'',''],[Manhattan,SK Ventures,AEA Investors],[25,000,25,000,10,000]]

For doing so, I wanna parse texts from each line of the text file. For example, from the first line, I can create [1,firm A, Manhattan, 25,000]. However, I can't figure out how exactly to do it. Every text starts at the same position, but ends at different positions. Is there any good way to do this?

Thank you.

Answer 1

From what you've given as data*, the input changes if the lines starts with a number or a space, and the data can be separated as

(numbers)(spaces)(letters with 1 space)(spaces)(letters with 1 space)(spaces)(numbers+commas)

or

(spaces)(letters with 1 space)(spaces)(numbers+commas)

That's what the two regexes below look for, and they build a dictionary with indexes from the leading numbers, each having a firm name and a list of company and value pairs.

I can't really tell what your matrix arrangement is.

import re

data = {}
f = open('data.txt')
for line in f:
    if re.match('^\d', line):
        matches = re.findall('^(\d+)\s+((\S\s|\s\S|\S)+)\s\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
        idx, firm, x, company, y, value = matches[0]
        data[idx] = {}
        data[idx]['firm'] = firm.strip()
        data[idx]['company'] = [(company.strip(), value)]
    else:
        matches = re.findall('\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
        company, x, value = matches[0]
        data[idx]['company'].append((company.strip(), value))

import pprint
pprint.pprint(data)

->

{'1': {'company': [('Manhattan (company name)', '25,000'),
                   ('SK Ventures', '25,000'),
                   ('AEA investors', '10,000')],
       'firm': 'firm A'},

 '2': {'company': [('Tencent collaboration', '16,000'),
                   ('id TechVentures', '4,000')],
       'firm': 'firm B'},

 '3': {'company': [('xxx', '625')], 
       'firm': 'firm C'}
}

* This works on your example, but it may not work on your real data very well. YMMV.

Answer 2

Well if you know all of the start positions:

# 0123456789012345678901234567890123456789012345678901234567890
# 1       firm A         Manhattan (company name)     25,000 
#                        SK Ventures                  25,000
#                        AEA investors                10,000 
# 2       firm B         Tencent collaboration        16,000 
#                        id TechVentures              4,000 
# 3       firm C         xxx                          625 
# Field #1 is 8 wide (0 -> 7)
# Field #2 is 15 wide (8 -> 22)
# Field #3 is 19 wide (23 -> 41) 
# Field #4 is arbitrarily wide (42 -> end of line)
field_lengths = [ 8, 15, 19, ]
data = []
with open('/path/to/file', 'r') as f:
    row = f.readline()
    row = row.strip()
    pieces = []
    for x in field_lengths:
        piece = row[:x].strip()
        pieces.append(piece)
        row = row[x:]
    pieces.append(row)
    data.append(pieces)

Answer 3

If I understand you correctly (although I'm not totally sure I do), this will produce the output I think your looking for.

import re

with open('data.txt', 'r') as f:
    f_txt = f.read() # Change file object to text
    f_lines = re.split(r'\n(?=\d)', f_txt)
    matrix = []
    for line in f_lines:
        inner1 = line.split('\n')
        inner2 = [re.split(r'\s{2,}', l) for l in inner1]
        matrix.append(inner2)

print(matrix)
print('')
for row in matrix:
    print(row)

Output of the program:

[[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']], [['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']], [['3', 'firm C', 'xxx', '625']]]

[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']]
[['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']]
[['3', 'firm C', 'xxx', '625']]

I am basing this on the fact that you wanted the first row of your matrix to be: [[1,Firm A,Manhattan,25,000],['',SK Ventures,25,000],['',AEA investors,10,000]]

However, to achieve this with more rows, we then get a list that is nested 3 levels deep. Such is the output of print(matrix) . This can be a little unwieldy to use, which is why TessellatingHeckler's answer uses a dictionary to store the data, which I think is a much better way to access what you need. But if a list of list of "matrices' is what your after, then the code I wrote above does that.

Python: parsing texts in a .txt file

Question

3 answers

solution1
0 2016-05-03 03:16:45

solution2
0 ACCPTED 2016-05-03 03:28:00

solution3
0 2016-05-03 10:21:57

Python: parsing texts in a .txt file

Question

3 answers

solution1 0 2016-05-03 03:16:45

solution2 0 ACCPTED 2016-05-03 03:28:00

solution3 0 2016-05-03 10:21:57

solution1
0 2016-05-03 03:16:45

solution2
0 ACCPTED 2016-05-03 03:28:00

solution3
0 2016-05-03 10:21:57