简体   繁体   English

将列表列表转换为结构化的 Pandas 数据框

[英]Convert a list of lists to a structured pandas dataframe

I am trying to convert the following data structure;我正在尝试转换以下数据结构;

原始结构

To the format below in python 3;到python 3下面的格式;

输出结构

if your data looks like:如果您的数据如下所示:

array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
        ['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]

You can do this:你可以这样做:

1 Step: use regular expressions to parse your data, because it is string. 1 步:使用正则表达式来解析你的数据,因为它是字符串。

see more about reg-exp查看更多关于 reg-exp

raws=list()
for index in range(0,len(array)):    
    raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))

Output:输出:

[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]

2 Step: extract raw values and column names. 2 步骤:提取原始值和列名。

columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]

Output:输出:

raws -原料 -

[['123' '222' '124' 'Sea']
 ['456' '555' '678' 'Chi']]

columns -列 -

['PIN' 'COD' 'LOA' 'LOC']

3 Step: Now we can just create df.第 3 步:现在我们可以创建 df。

df = pd.DataFrame(raws, columns=columns)

Output:输出:

   PIN  COD  LOA  LOC
0  123  222  124  Sea
1  456  555  678  Chi

Is it what you want?是你想要的吗?

I hope it helps, I'm not sure about your input format.我希望它有所帮助,我不确定您的输入格式。

And don't forget import libraries!并且不要忘记导入库! (I used pandas as pd, numpy as np, re). (我用pandas 作为pd,numpy 作为np,re)。

UPD: another way I have created log file like you have: UPD:我创建日志文件的另一种方式,就像你一样:

array = open('example.log').readlines()

Output:输出:

['PIN: 123 COD: 222 \n',
 'LOA: 124 LOC: Sea \n',
 'PIN: 12 COD: 322 \n',
 'LOA: 14 LOC: Se \n']

Then split by ' ' , drop '\\n' and reshape:然后由 ' ' 分割,删除 '\\n' 并重塑:

raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)

In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change.在重塑中,第一个数字是未来数据框中的原始计数,第二个数字是列数,最后一个数字是您不需要更改。 It won't works if you don't have whitespace between info and '\\n' in each raw.如果您在每个原始文件中的 info 和 '\\n' 之间没有空格,它将不起作用。 If you don't, I will change an example.如果你不这样做,我会换一个例子。 Output:输出:

array([[['PIN:', '123'],
        ['COD:', '222'],
        ['LOA:', '124'],
        ['LOC:', 'Sea']],

       [['PIN:', '12'],
        ['COD:', '322'],
        ['LOA:', '14'],
        ['LOC:', 'Se']]], 
      dtype='|S4')

And then take raws and columns:然后获取原始数据和列:

columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]

Finally, create dataframe (and cat last symbol for columns):最后,创建数据框(并为列创建最后一个符号):

pd.DataFrame(raws, columns=[i[:-1] for i in columns])

Output:输出:

   PIN  COD  LOA  LOC
0  123  222  124  Sea
1   12  322   14   Se

If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.如果您有许多日志文件,您可以在 for 循环中为每个文件执行此操作,将每个数据帧保存在数组中(例如,数组调用 DF_array),然后使用 pd.concat 从数据帧数组中执行一个数据帧。

pd.concat(DF_array)

If you need I can add an example.如果你需要,我可以添加一个例子。

UPD: I have created a dir with log files and then make array with all files from PATH: UPD:我创建了一个包含日志文件的目录,然后使用 PATH 中的所有文件创建数组:

PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]

Then do for-loop like in last update:然后像上次更新一样执行 for 循环:

dfs = list()
for f in files:
    array = open(f).readlines()
    raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
    columns = np.array(raws)[:,:,0][0]
    raws = np.array(raws)[:,:,1]
    df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
    dfs.append(df)
result = pd.concat(dfs)

Output:输出:

     PIN   COD    LOA  LOC
0    123   222    124  Sea
1     12   322     14   Se
2      1    32      4  Ses
0  15673  2324  13464  Sss
1  12452  3122  11234   Se
2     11   132      4  Ses
0    123   222    124  Sea
1     12   322     14   Se
2      1    32      4  Ses

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM