Python Text to Data Frame with Specific Pattern

Question

I am trying to convert a bunch of text files into a data frame using Pandas.

Thanks to Stack Overflow's amazing community, I almost got the desired output (OP: Python Text File to Data Frame with Specific Pattern ).

Basically I need to turn a text with specific patterns (but sometimes missing data) into a data frame using Pandas.

Here is an example:

Number 01600 London                           Register  4314

Some random text...

************************************* B ***************************************
 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADDR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN:            ADDR: Streetname 4   2000
 5 SHARE: 23/1284
   James Klein
   BORN:            ADDR:      
   c 4222/1988 Supporting Text
   f 4222/2000 Extra Text
************************************* C ***************************************
More random text...

From the example above, I need to transform the text between ***B*** and ***C*** into a data frame with the following output:

Number	Register	City	Id	Share	Name	Born	Address	c	f	h	i
01600	4314	London	1	73/1284	John Smith	1960-01-01	Streetname 3/2 1000	NaN	4222/2001	1334/2000	5774/2000
01600	4314	London	4	58/1284	Boris Morgan	NaN	Streetname 4 2000	NaN	NaN	NaN	NaN
01600	4314	London	5	23/1284	James Klein	NaN	NaN	4222/1988 Supporting Text	4222/2000 Extra Text	NaN	NaN

Some of the patterns:

The first row for the group contains the word SHARE; before this word is the Id and after it is the Share .
The second row contains the name of the person (and should be extracted entirely to the Name variable).
The third row contains the birthdate (BORN) and the address (ADDR). Sometimes this information is missing - in these cases, the variables Born and Address should be NaN.
When it exists, the fourth row and beyond (which lasts until the next group is reached) starts with a lowercase letter. Each of these rows should be extracted until the end of the paragraph to a variable with the name being the leading lowercase letter.

The code below works when the birthdate and address are available, and when the fourth row and beyond contains only one chunk of information (in the previous example, SHARE: 73/1284 from John Smith has rows f, h and i - all with only one chunk of information and SHARE: 23/1284 from James Klein contains multiple chunks).

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

************************************* B ***************************************
 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADDR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN:            ADDR: Streetname 4   2000
 5 SHARE: 23/1284
   James Klein
   BORN:            ADDR:      
   c 4222/1988 Supporting Text
   f 4222/2000 Extra Text
************************************* C ***************************************
More random text...'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append(i)
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Does anyone know how to fix these issues? Thanks in advance.

Answer 1

I wouldn't use the same variable i for both inner and outer loops. Changing your for loop to the following should work cleaner:

for i in items:
    d = {'Number': number, 
         'Register': register, 
         'City': city, 
         'Id': int(i[0].split()[0]), 
         'Share': i[0].split(': ')[1], 
         'Name': i[1], 
         }
    
    if "ADDR" in i[2]:
        born, address = i[2].split("ADDR:")
        d['Born'] = born.replace("BORN:", "").strip()
        d['Address'] = address.strip()
    else:
        d['Born']: i[2].split()[1]
    
    if len(i)>3:
        for j in i[3:]:
            key, value = j.split(" ", 1)
            d[key] = value
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

Answer 2

You can slice the list to include only the relevant values by getting the index numbers of the delimiter lines:

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

************************************* B ***************************************
 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADDR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN:            ADDR: Streetname 4   2000
 5 SHARE: 23/1284
   James Klein
   BORN:            ADDR:      
   c 4222/1988 Supporting Text
   f 4222/2000 Extra Text
************************************* C ***************************************
More random text...'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# get index numbers of delimiter values and filter list
start, end = [text.index(i) for i in text if '*****' in i]
text = text[start+1:end]

data = []

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1] if i[2].split()[1][:4].isnumeric() else None}
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append([i[0], ' '.join(i[1:])])
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

	Number	Register	City	Id	Share	Name	Born	f	h	i	c
0	01600	4314	London	1	73/1284	John Smith	1960-01-01	4222/2001	1334/2000	5774/2000	nan
1	01600	4314	London	4	58/1284	Boris Morgan		nan	nan	nan	nan
2	01600	4314	London	5	23/1284	James Klein		4222/2000 Extra Text	nan	nan	4222/1988 Supporting Text

Python Text to Data Frame with Specific Pattern

Question

2 answers

solution1
0 2021-07-12 20:46:54

solution2
0 2021-07-12 20:53:51

Python Text to Data Frame with Specific Pattern

Question

2 answers

solution1 0 2021-07-12 20:46:54

solution2 0 2021-07-12 20:53:51

solution1
0 2021-07-12 20:46:54

solution2
0 2021-07-12 20:53:51