简体   繁体   中英

Pandas: parsing values in structured non tabular text

I have a text file with a format like this:

k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"

What is the most pandorable way of reading these data into a dataframe of this form:

        A       B
0       k1      v1
1       k1      v2
2       k2      v1'
3       k3      v1"
4       k3      v2"
5       k3      v3"

that does not involve manual looping? alternatively is there any other library that allows me to input only some regular expressions that would specify the structure of my text file and output the data in the tabular form described above?

setup
borrowing from @jezrael

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)

  • str.extract with parameters specified in the regex and look ahead
  • use duplicated to identify the rows we want to keep.

df = df.B.str.extract('(?P<A>.*(?=\[a-token\]))?(?P<B>.*)', expand=True).ffill()
df[df.duplicated(subset=['A'])].reset_index(drop=True)

    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

You can use read_csv with some separator which is not in data like | or ¥ :

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
             B
0  k1[a-token]
1           v1
2           v2
3  k2[a-token]
4          v1'
5  k3[a-token]
6          v1"
7          v2"
8          v3"

Then insert new column A with extract values with [a-token] and last use boolean indexing with mask by duplicated for remove rows with keys in values column:

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

But if file have duplicated keys :

print (df)
              B
0   k1[a-token]
1            v1
2            v2
3   k2[a-token]
4           v1'
5   k3[a-token]
6           v1"
7           v2"
8           v3"
9   k2[a-token]
10          v1'

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A            B
0  k1           v1
1  k1           v2
2  k2          v1'
3  k3          v1"
4  k3          v2"
5  k3          v3"
6  k2  k2[a-token]
7  k2          v1'

Then is necessary change mask to:

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[~df['B'].str.contains('\[a-token]')].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"
6  k2  v1'

With your file as 'temp.txt'...

df = pd.read_csv('temp.txt',
                 header=None,
                 delim_whitespace=True,
                 names=['data'])

bins = df.data.str.endswith('[a-token]')

idx_bins = df[bins][:]
idx_bins.data = idx_bins.data.str.rstrip(to_strip='[a-token]')
idx_vals = df[~bins][:]

a = pd.DataFrame(idx_bins.index.values, columns=['a'])
b = pd.DataFrame(idx_vals.index.values, columns=['b'])

merge_df = pd.merge_asof(b, a, left_on='b', right_on='a')
new_df = pd.DataFrame({'A': idx_bins.data.loc[list(merge_df.a)].values, 
                       'B': idx_vals.data.values})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM