简体   繁体   中英

Python: parsing texts in a .txt file

I have a text file like this.

1       firm A         Manhattan (company name)     25,000 
                       SK Ventures                  25,000
                       AEA investors                10,000 
2       firm B         Tencent collaboration        16,000 
                       id TechVentures              4,000 
3       firm C         xxx                          625 
(and so on) 

I want to make a matrix form and put each item into the matrix. For example, the first row of matrix would be like:

[[1,Firm A,Manhattan,25,000],['','',SK Ventures,25,000],['','',AEA investors,10,000]]

or,

[[1,'',''],[Firm A,'',''],[Manhattan,SK Ventures,AEA Investors],[25,000,25,000,10,000]]

For doing so, I wanna parse texts from each line of the text file. For example, from the first line, I can create [1,firm A, Manhattan, 25,000]. However, I can't figure out how exactly to do it. Every text starts at the same position, but ends at different positions. Is there any good way to do this?

Thank you.

From what you've given as data*, the input changes if the lines starts with a number or a space, and the data can be separated as

(numbers)(spaces)(letters with 1 space)(spaces)(letters with 1 space)(spaces)(numbers+commas)

or

(spaces)(letters with 1 space)(spaces)(numbers+commas)

That's what the two regexes below look for, and they build a dictionary with indexes from the leading numbers, each having a firm name and a list of company and value pairs.

I can't really tell what your matrix arrangement is.

import re

data = {}
f = open('data.txt')
for line in f:
    if re.match('^\d', line):
        matches = re.findall('^(\d+)\s+((\S\s|\s\S|\S)+)\s\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
        idx, firm, x, company, y, value = matches[0]
        data[idx] = {}
        data[idx]['firm'] = firm.strip()
        data[idx]['company'] = [(company.strip(), value)]
    else:
        matches = re.findall('\s+((\S\s|\s\S|\S)+)\s\s+([0-9,]+)', line)
        company, x, value = matches[0]
        data[idx]['company'].append((company.strip(), value))

import pprint
pprint.pprint(data)

->

{'1': {'company': [('Manhattan (company name)', '25,000'),
                   ('SK Ventures', '25,000'),
                   ('AEA investors', '10,000')],
       'firm': 'firm A'},

 '2': {'company': [('Tencent collaboration', '16,000'),
                   ('id TechVentures', '4,000')],
       'firm': 'firm B'},

 '3': {'company': [('xxx', '625')], 
       'firm': 'firm C'}
}

* This works on your example, but it may not work on your real data very well. YMMV.

Well if you know all of the start positions:

# 0123456789012345678901234567890123456789012345678901234567890
# 1       firm A         Manhattan (company name)     25,000 
#                        SK Ventures                  25,000
#                        AEA investors                10,000 
# 2       firm B         Tencent collaboration        16,000 
#                        id TechVentures              4,000 
# 3       firm C         xxx                          625 
# Field #1 is 8 wide (0 -> 7)
# Field #2 is 15 wide (8 -> 22)
# Field #3 is 19 wide (23 -> 41) 
# Field #4 is arbitrarily wide (42 -> end of line)
field_lengths = [ 8, 15, 19, ]
data = []
with open('/path/to/file', 'r') as f:
    row = f.readline()
    row = row.strip()
    pieces = []
    for x in field_lengths:
        piece = row[:x].strip()
        pieces.append(piece)
        row = row[x:]
    pieces.append(row)
    data.append(pieces)

If I understand you correctly (although I'm not totally sure I do), this will produce the output I think your looking for.

import re

with open('data.txt', 'r') as f:
    f_txt = f.read() # Change file object to text
    f_lines = re.split(r'\n(?=\d)', f_txt)
    matrix = []
    for line in f_lines:
        inner1 = line.split('\n')
        inner2 = [re.split(r'\s{2,}', l) for l in inner1]
        matrix.append(inner2)

print(matrix)
print('')
for row in matrix:
    print(row)

Output of the program:

[[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']], [['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']], [['3', 'firm C', 'xxx', '625']]]

[['1', 'firm A', 'Manhattan (company name)', '25,000'], ['', 'SK Ventures', '25,000'], ['', 'AEA investors', '10,000']]
[['2', 'firm B', 'Tencent collaboration', '16,000'], ['', 'id TechVentures', '4,000']]
[['3', 'firm C', 'xxx', '625']]

I am basing this on the fact that you wanted the first row of your matrix to be: [[1,Firm A,Manhattan,25,000],['',SK Ventures,25,000],['',AEA investors,10,000]]

However, to achieve this with more rows, we then get a list that is nested 3 levels deep. Such is the output of print(matrix) . This can be a little unwieldy to use, which is why TessellatingHeckler's answer uses a dictionary to store the data, which I think is a much better way to access what you need. But if a list of list of "matrices' is what your after, then the code I wrote above does that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM