[英]Pandas: parsing values in structured non tabular text
I have a text file with a format like this: 我有一个文本文件,格式如下:
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
What is the most pandorable way of reading these data into a dataframe of this form: 将这些数据读入此表单的数据框中最令人饶恕的方式是什么:
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
that does not involve manual looping? 那不涉及手动循环? alternatively is there any other library that allows me to input only some regular expressions that would specify the structure of my text file and output the data in the tabular form described above? 或者是否有任何其他库允许我只输入一些正则表达式,这些表达式将指定我的文本文件的结构并以上述表格形式输出数据?
setup 建立
borrowing from @jezrael 借用@jezrael
import pandas as pd
from pandas.compat import StringIO
temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
str.extract
with parameters specified in the regex and look ahead str.extract
使用正则表达式中指定的参数并向前看 duplicated
to identify the rows we want to keep. 使用duplicated
来标识我们想要保留的行。 df = df.B.str.extract('(?P<A>.*(?=\[a-token\]))?(?P<B>.*)', expand=True).ffill()
df[df.duplicated(subset=['A'])].reset_index(drop=True)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
You can use read_csv
with some separator which is not in data like |
您可以将read_csv
与某些分隔符一起使用,该分隔符不在像|
那样的数据中 or ¥
: 或¥
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
B
0 k1[a-token]
1 v1
2 v2
3 k2[a-token]
4 v1'
5 k3[a-token]
6 v1"
7 v2"
8 v3"
Then insert
new column A
with extract
values with [a-token]
and last use boolean indexing
with mask by duplicated
for remove rows with keys
in values
column: 然后使用[a-token]
insert
带有extract
值的新列A
,最后使用带duplicated
掩码的boolean indexing
,以删除values
列中带keys
行:
df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
But if file have duplicated keys
: 但是如果文件有重复的keys
:
print (df)
B
0 k1[a-token]
1 v1
2 v2
3 k2[a-token]
4 v1'
5 k3[a-token]
6 v1"
7 v2"
8 v3"
9 k2[a-token]
10 v1'
df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
6 k2 k2[a-token]
7 k2 v1'
Then is necessary change mask
to: 然后是必要的更改mask
:
df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[~df['B'].str.contains('\[a-token]')].reset_index(drop=True)
print (df)
A B
0 k1 v1
1 k1 v2
2 k2 v1'
3 k3 v1"
4 k3 v2"
5 k3 v3"
6 k2 v1'
With your file as 'temp.txt'... 将您的文件设为'temp.txt'...
df = pd.read_csv('temp.txt',
header=None,
delim_whitespace=True,
names=['data'])
bins = df.data.str.endswith('[a-token]')
idx_bins = df[bins][:]
idx_bins.data = idx_bins.data.str.rstrip(to_strip='[a-token]')
idx_vals = df[~bins][:]
a = pd.DataFrame(idx_bins.index.values, columns=['a'])
b = pd.DataFrame(idx_vals.index.values, columns=['b'])
merge_df = pd.merge_asof(b, a, left_on='b', right_on='a')
new_df = pd.DataFrame({'A': idx_bins.data.loc[list(merge_df.a)].values,
'B': idx_vals.data.values})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.