I have a text file with a format like this:
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
What is the most pandorable way of reading these data into a dataframe of this form:
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
that does not involve manual looping? alternatively is there any other library that allows me to input only some regular expressions that would specify the structure of my text file and output the data in the tabular form described above?
setup
borrowing from @jezrael
import pandas as pd
from pandas.compat import StringIO
temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
str.extract
with parameters specified in the regex and look ahead duplicated
to identify the rows we want to keep. df = df.B.str.extract('(?P<A>.*(?=\[a-token\]))?(?P<B>.*)', expand=True).ffill()
df[df.duplicated(subset=['A'])].reset_index(drop=True)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
You can use read_csv
with some separator which is not in data like |
or ¥
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
B
0 k1[a-token]
1 v1
2 v2
3 k2[a-token]
4 v1'
5 k3[a-token]
6 v1"
7 v2"
8 v3"
Then insert
new column A
with extract
values with [a-token]
and last use boolean indexing
with mask by duplicated
for remove rows with keys
in values
column:
df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
But if file have duplicated keys
:
print (df)
B
0 k1[a-token]
1 v1
2 v2
3 k2[a-token]
4 v1'
5 k3[a-token]
6 v1"
7 v2"
8 v3"
9 k2[a-token]
10 v1'
df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
6 k2 k2[a-token]
7 k2 v1'
Then is necessary change mask
to:
df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[~df['B'].str.contains('\[a-token]')].reset_index(drop=True)
print (df)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
6 k2 v1'
With your file as 'temp.txt'...
df = pd.read_csv('temp.txt',
header=None,
delim_whitespace=True,
names=['data'])
bins = df.data.str.endswith('[a-token]')
idx_bins = df[bins][:]
idx_bins.data = idx_bins.data.str.rstrip(to_strip='[a-token]')
idx_vals = df[~bins][:]
a = pd.DataFrame(idx_bins.index.values, columns=['a'])
b = pd.DataFrame(idx_vals.index.values, columns=['b'])
merge_df = pd.merge_asof(b, a, left_on='b', right_on='a')
new_df = pd.DataFrame({'A': idx_bins.data.loc[list(merge_df.a)].values,
'B': idx_vals.data.values})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.