简体   繁体   English

Pandas:解析结构化非表格文本中的值

[英]Pandas: parsing values in structured non tabular text

I have a text file with a format like this: 我有一个文本文件,格式如下:

k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"

What is the most pandorable way of reading these data into a dataframe of this form: 将这些数据读入此表单的数据框中最令人饶恕的方式是什么:

        A       B
0       k1      v1
1       k1      v2
2       k2      v1'
3       k3      v1"
4       k3      v2"
5       k3      v3"

that does not involve manual looping? 那不涉及手动循环? alternatively is there any other library that allows me to input only some regular expressions that would specify the structure of my text file and output the data in the tabular form described above? 或者是否有任何其他库允许我只输入一些正则表达式,这些表达式将指定我的文本文件的结构并以上述表格形式输出数据?

setup 建立
borrowing from @jezrael 借用@jezrael

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)

  • str.extract with parameters specified in the regex and look ahead str.extract使用正则表达式中指定的参数并向前看
  • use duplicated to identify the rows we want to keep. 使用duplicated来标识我们想要保留的行。

df = df.B.str.extract('(?P<A>.*(?=\[a-token\]))?(?P<B>.*)', expand=True).ffill()
df[df.duplicated(subset=['A'])].reset_index(drop=True)

    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

You can use read_csv with some separator which is not in data like | 您可以将read_csv与某些分隔符一起使用,该分隔符不在像|那样的数据中 or ¥ : ¥

import pandas as pd
from pandas.compat import StringIO

temp=u"""
k1[a-token]
v1
v2
k2[a-token]
v1'
k3[a-token]
v1"
v2"
v3"
"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="|", names=['B'])
print (df)
             B
0  k1[a-token]
1           v1
2           v2
3  k2[a-token]
4          v1'
5  k3[a-token]
6          v1"
7          v2"
8          v3"

Then insert new column A with extract values with [a-token] and last use boolean indexing with mask by duplicated for remove rows with keys in values column: 然后使用[a-token] insert带有extract值的新列A ,最后使用带duplicated掩码的boolean indexing ,以删除values列中带keys行:

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"

But if file have duplicated keys : 但是如果文件有重复的keys

print (df)
              B
0   k1[a-token]
1            v1
2            v2
3   k2[a-token]
4           v1'
5   k3[a-token]
6           v1"
7           v2"
8           v3"
9   k2[a-token]
10          v1'

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[df['A'].duplicated()].reset_index(drop=True)
print (df)
    A            B
0  k1           v1
1  k1           v2
2  k2          v1'
3  k3          v1"
4  k3          v2"
5  k3          v3"
6  k2  k2[a-token]
7  k2          v1'

Then is necessary change mask to: 然后是必要的更改mask

df.insert(0, 'A', df['B'].str.extract('(.*)\[a-token\]', expand=False).ffill())
df = df[~df['B'].str.contains('\[a-token]')].reset_index(drop=True)
print (df)
    A    B
0  k1   v1
1  k1   v2
2  k2  v1'
3  k3  v1"
4  k3  v2"
5  k3  v3"
6  k2  v1'

With your file as 'temp.txt'... 将您的文件设为'temp.txt'...

df = pd.read_csv('temp.txt',
                 header=None,
                 delim_whitespace=True,
                 names=['data'])

bins = df.data.str.endswith('[a-token]')

idx_bins = df[bins][:]
idx_bins.data = idx_bins.data.str.rstrip(to_strip='[a-token]')
idx_vals = df[~bins][:]

a = pd.DataFrame(idx_bins.index.values, columns=['a'])
b = pd.DataFrame(idx_vals.index.values, columns=['b'])

merge_df = pd.merge_asof(b, a, left_on='b', right_on='a')
new_df = pd.DataFrame({'A': idx_bins.data.loc[list(merge_df.a)].values, 
                       'B': idx_vals.data.values})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM