I have tabular data in a text file so I am trying to fetch the data using python but I can not find the delimiter between each of the columns. Please help me out. Thanks in advance.
Data could look like this:
Column1 Column2 Column3 Column4
----------------------------------------------------------------------------
apple fruits banana fruits orange fruits grapes fruits
mango fruits pineapple fruits blackberry fruits
blueberry fruits currant fruits papaya fruits
chico fruits peach fruits pear fruits
My expected result is in dictionary format.
I'm working on the assumption that data is aligned at the same columns in each record.
I put the header line and a typical line in two distict variables, you are going to read them from a file
>>> a = 'Column1 Column2 Column3 Column4'
>>> b = 'apple fruits banana fruits orange fruits grapes fruits'
i
is a list of indices into the header, initially empty, and inside
represent the fact that we are inside a column name
>>> i = []
>>> inside = False
We count the characters and check if we are at the beginning of a column name
>>> for n, c in enumerate(a):
... if c == ' ':
... inside = False
... continue
... if not inside:
... inside = True
... i.append(n)
>>> i
[0, 18, 38, 58]
we have the indices of the beginnings of columns, and the beginning of the next one is, in slice notation, also the end of the current one --- we need just the end of the last column but using slice notation we can use the value None
>>> [b[j:k].rstrip() for j, k in zip(i,i[1:]+[None])]
['apple fruits', 'banana fruits', 'orange fruits', 'grapes fruits']
of course you have to apply the same indices trick to every data line in the input file.
PS: you may want to use the itertools.zip_longest
method as in
[... for j, k in itertools.zip_longest(i, i[1:])]
you may want to cache the generator to avoid instantiating it for every data line
cached_indices = list(itertools.zip_longest(i, i[1:]))
for line in data:
c1, c2, c3, c4 = [... for i, j in cached_indices]
I tried to implement what I had suggested in a comment below, here it is my best effort...
$ cat fetch.py
from itertools import count # this import is necessary
from io import StringIO # this one is needed to simulate an open file
# Your data, notice that some field in the last two lines is misaligned
data = '''\
Column1 Column2 Column3 Column4
----------------------------------------------------------------------------
apple fruits banana fruits orange fruits grapes fruits
mango fruits pineapple fruits blackberry fruits
blueberry fruits currant fruits papaya fruits
chico fruits peach fruits pear fruits
'''
f = StringIO(data) # you may have something like
# f = open('fruitfile.fixed')
# read the header line and skip a line
header = next(f).rstrip()
next(f) # skip a line
# a compact way of finding the starts of the columns
indices = [i for i, c0, c1 in zip(count(), ' '+header, header)
if c0==' ' and c1!=' ']
# We are going to reuse zip(indices, indices[1:]+[None]), so we cache it
ranges = list(zip(indices, indices[1:]+[None]))
# we are ready for a loop on the lines of the file
for nl, line in enumerate(f, 3):
if line == '\n': continue # don't process blank lines
# extract the _raw_ fields from a line
fields = [line[i:j] for i, j in ranges]
# check that a non-all-blanks field does not start with a blank,
# check that a field does not terminate wit anything but a space
# or a newline
if any((f[0]==' ' and f.rstrip()) or f[-1] not in ' \n' for f in fields):
# signal the possibility of a misalignment
print('Possible misalignment in line n.%d:'%nl)
print('\t|'+header)
print('\t|'+line.rstrip())
# the else body is executed if all the fields are OK
# what I do with the fields is just a possibility
else:
print('Data Line n.%d:'%nl)
fields = [field.rstrip() for field in fields]
for nf, field in enumerate(fields, 1):
print('\tField n.%d:\t%r'%(nf, field))
$ python3 fetch.py
Data Line n.3:
Field n.1: 'apple fruits'
Field n.2: 'banana fruits'
Field n.3: 'orange fruits'
Field n.4: 'grapes fruits'
Data Line n.4:
Field n.1: 'mango fruits'
Field n.2: 'pineapple fruits'
Field n.3: ''
Field n.4: 'blackberry fruits'
Possible misalignment in line n.5:
|Column1 Column2 Column3 Column4
| blueberry fruits currant fruits papaya fruits
Possible misalignment in line n.6:
|Column1 Column2 Column3 Column4
|chico fruits peach fruits pear fruits
$
The magic of [0, 18, 38, 58]
, the starting positions of columns, also plays a role in my answer, but it based on numpy.genfromtxt()
from pathlib import Path
import pandas as pd
import numpy as np
# replicate the file
doc = """Column1 Column2 Column3 Column4
----------------------------------------------------------------------------
apple fruits banana fruits orange fruits grapes fruits
mango fruits pineapple fruits blackberry fruits
blueberry fruits currant fruits papaya fruits
chico fruits peach fruits pear fruits"""
Path('temp.txt').write_text(doc)
# read the file
lines = Path('temp.txt').read_text().split('\n')
# play with header to find the column widths
header = lines[0]
length = max([len(line) for line in lines])
starts = [i for i, char in enumerate(header) if char=='C'] + [length]
widths = [x-prev for x, prev in zip(starts[1:], starts[:-1])]
assert sum(widths) == length
data = np.genfromtxt('temp.txt', dtype=None, delimiter=widths, autostrip=True,
encoding='utf-8')
# make pandas dataframe
colnames = [x for x in header.split(' ') if x]
df = pd.DataFrame(data[2:], columns=colnames)
# check it is what we wanted
assert df.to_csv(index=False) == \
"""Column1,Column2,Column3,Column4
apple fruits,banana fruits,orange fruits,grapes fruits
mango fruits,pineapple fruits,,blackberry fruits
,blueberry fruits,currant fruits,papaya fruits
chico fruits,,peach fruits,pear fruits
"""
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.