简体   繁体   中英

How to fetch the table format data in text file using python

I have tabular data in a text file so I am trying to fetch the data using python but I can not find the delimiter between each of the columns. Please help me out. Thanks in advance.

Data could look like this:

Column1           Column2         Column3            Column4
----------------------------------------------------------------------------
apple fruits      banana fruits     orange fruits    grapes fruits
mango fruits      pineapple fruits                   blackberry fruits
                  blueberry fruits  currant fruits   papaya fruits
chico fruits                        peach fruits     pear fruits

My expected result is in dictionary format.

I'm working on the assumption that data is aligned at the same columns in each record.

I put the header line and a typical line in two distict variables, you are going to read them from a file

>>> a = 'Column1           Column2             Column3             Column4'
>>> b = 'apple fruits      banana fruits       orange fruits       grapes fruits'

i is a list of indices into the header, initially empty, and inside represent the fact that we are inside a column name

>>> i = []
>>> inside = False

We count the characters and check if we are at the beginning of a column name

>>> for n, c in enumerate(a):
...     if c == ' ':
...         inside = False
...         continue
...     if not inside:
...         inside = True
...         i.append(n)
>>> i
[0, 18, 38, 58]

we have the indices of the beginnings of columns, and the beginning of the next one is, in slice notation, also the end of the current one --- we need just the end of the last column but using slice notation we can use the value None

>>> [b[j:k].rstrip() for j, k in zip(i,i[1:]+[None])]
['apple fruits', 'banana fruits', 'orange fruits', 'grapes fruits']

of course you have to apply the same indices trick to every data line in the input file.

PS: you may want to use the itertools.zip_longest method as in

[... for j, k in itertools.zip_longest(i, i[1:])]

you may want to cache the generator to avoid instantiating it for every data line

cached_indices = list(itertools.zip_longest(i, i[1:]))
for line in data:
    c1, c2, c3, c4 = [... for i, j in cached_indices]

I tried to implement what I had suggested in a comment below, here it is my best effort...

$ cat fetch.py
from itertools import count  # this import is necessary
from io import StringIO      # this one is needed to simulate an open file

# Your data, notice that some field in the last two lines is misaligned
data = '''\
Column1           Column2           Column3          Column4
----------------------------------------------------------------------------
apple fruits      banana fruits     orange fruits    grapes fruits
mango fruits      pineapple fruits                   blackberry fruits
                   blueberry fruits currant fruits   papaya fruits
chico fruits                        peach fruits    pear fruits
'''

f = StringIO(data) # you may have something like
                   # f = open('fruitfile.fixed')

# read the header line and skip a line                   
header = next(f).rstrip()
next(f) # skip a line

# a compact way of finding the starts of the columns
indices = [i for i, c0, c1 in zip(count(), ' '+header, header)
           if c0==' ' and c1!=' ']
# We are going to reuse zip(indices, indices[1:]+[None]), so we cache it
ranges = list(zip(indices, indices[1:]+[None]))

# we are ready for a loop on the lines of the file
for nl, line in enumerate(f, 3):
    if line == '\n': continue # don't process blank lines
    # extract the _raw_ fields from a line
    fields = [line[i:j] for i, j in ranges]
    # check that a non-all-blanks field does not start with a blank,
    # check that a field does not terminate wit anything but a space
    # or a newline
    if any((f[0]==' ' and f.rstrip()) or f[-1] not in ' \n' for f in fields):
        # signal the possibility of a misalignment
        print('Possible misalignment in line n.%d:'%nl)
        print('\t|'+header)
        print('\t|'+line.rstrip())
    # the else body is executed if all the fields are OK
    # what I do with the fields is just a possibility
    else:
        print('Data Line n.%d:'%nl)
        fields = [field.rstrip() for field in fields]
        for nf, field in enumerate(fields, 1):
            print('\tField n.%d:\t%r'%(nf, field))
$ python3 fetch.py 
Data Line n.3:
        Field n.1:      'apple fruits'
        Field n.2:      'banana fruits'
        Field n.3:      'orange fruits'
        Field n.4:      'grapes fruits'
Data Line n.4:
        Field n.1:      'mango fruits'
        Field n.2:      'pineapple fruits'
        Field n.3:      ''
        Field n.4:      'blackberry fruits'
Possible misalignment in line n.5:
        |Column1           Column2           Column3          Column4
        |                   blueberry fruits currant fruits   papaya fruits
Possible misalignment in line n.6:
        |Column1           Column2           Column3          Column4
        |chico fruits                        peach fruits    pear fruits
$ 

The magic of [0, 18, 38, 58] , the starting positions of columns, also plays a role in my answer, but it based on numpy.genfromtxt()

from pathlib import Path
import pandas as pd
import numpy as np

# replicate the file
doc = """Column1           Column2             Column3             Column4
----------------------------------------------------------------------------
apple fruits      banana fruits       orange fruits       grapes fruits
mango fruits      pineapple fruits                        blackberry fruits
                  blueberry fruits    currant fruits      papaya fruits
chico fruits                          peach fruits        pear fruits"""

Path('temp.txt').write_text(doc)

# read the file    
lines = Path('temp.txt').read_text().split('\n')

# play with header to find the column widths
header = lines[0]
length = max([len(line) for line in lines])
starts = [i for i, char in enumerate(header) if char=='C'] + [length]
widths = [x-prev for x, prev in zip(starts[1:], starts[:-1])] 
assert sum(widths) == length
data = np.genfromtxt('temp.txt', dtype=None, delimiter=widths, autostrip=True,
                     encoding='utf-8')

# make pandas dataframe 
colnames = [x for x in header.split(' ') if x]
df = pd.DataFrame(data[2:], columns=colnames)

# check it is what we wanted
assert df.to_csv(index=False) == \
"""Column1,Column2,Column3,Column4
apple fruits,banana fruits,orange fruits,grapes fruits
mango fruits,pineapple fruits,,blackberry fruits
,blueberry fruits,currant fruits,papaya fruits
chico fruits,,peach fruits,pear fruits
"""

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM