Pandas: parsing values in structured non tabular text

Question

I have a text file with a format like this:

k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"

What is the most pandorable way of reading these data into a dataframe of this form:

        A       B
0       k1      v1
1       k1      v2
2       k2      v1'
3       k3      v1"
4       k3      v2"
5       k3      v3"

that does not involve manual looping? alternatively is there any other library that allows me to input only some regular expressions that would specify the structure of my text file and output the data in the tabular form described above?

Answer 1

setup
borrowing from @jezrael

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)

str.extract with parameters specified in the regex and look ahead
use duplicated to identify the rows we want to keep.

df = df.B.str.extract('(?P<A>.*(?=\[a-token\]))?(?P<B>.*)', expand=True).ffill()
df[df.duplicated(subset=['A'])].reset_index(drop=True)

    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

Answer 2

You can use read_csv with some separator which is not in data like | or ¥ :

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
             B
0  k1[a-token]
1           v1
2           v2
3  k2[a-token]
4          v1'
5  k3[a-token]
6          v1"
7          v2"
8          v3"

Then insert new column A with extract values with [a-token] and last use boolean indexing with mask by duplicated for remove rows with keys in values column:

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

But if file have duplicated keys :

print (df)
              B
0   k1[a-token]
1            v1
2            v2
3   k2[a-token]
4           v1'
5   k3[a-token]
6           v1"
7           v2"
8           v3"
9   k2[a-token]
10          v1'

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A            B
0  k1           v1
1  k1           v2
2  k2          v1'
3  k3          v1"
4  k3          v2"
5  k3          v3"
6  k2  k2[a-token]
7  k2          v1'

Then is necessary change mask to:

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[~df['B'].str.contains('\[a-token]')].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"
6  k2  v1'

Answer 3

With your file as 'temp.txt'...

df = pd.read_csv('temp.txt',
                 header=None,
                 delim_whitespace=True,
                 names=['data'])

bins = df.data.str.endswith('[a-token]')

idx_bins = df[bins][:]
idx_bins.data = idx_bins.data.str.rstrip(to_strip='[a-token]')
idx_vals = df[~bins][:]

a = pd.DataFrame(idx_bins.index.values, columns=['a'])
b = pd.DataFrame(idx_vals.index.values, columns=['b'])

merge_df = pd.merge_asof(b, a, left_on='b', right_on='a')
new_df = pd.DataFrame({'A': idx_bins.data.loc[list(merge_df.a)].values, 
                       'B': idx_vals.data.values})

Pandas: parsing values in structured non tabular text

Question

3 answers

solution1
3 ACCPTED 2017-02-12 06:59:13

solution2
1 2017-02-12 06:32:54

solution3
0 2017-02-12 12:54:57

Pandas: parsing values in structured non tabular text

Question

3 answers

solution1 3 ACCPTED 2017-02-12 06:59:13

solution2 1 2017-02-12 06:32:54

solution3 0 2017-02-12 12:54:57

solution1
3 ACCPTED 2017-02-12 06:59:13

solution2
1 2017-02-12 06:32:54

solution3
0 2017-02-12 12:54:57