读取多个 yaml 文件到 pandas Dataframe

Question

I do realize this has already been addressed here (eg, Reading csv zipped files in python , How can I parse a YAML file in Python , Retrieving data from a yaml file based on a Python list ). I do realize this has already been addressed here (eg, Reading csv zipped files in python , How can I parse a YAML file in Python , Retrieving data from a yaml file based on a Python list ). Nevertheless, I hope this question was different.不过，我希望这个问题有所不同。

I know loading a YAML file to pandas dataframe我知道将YAML文件加载到 pandas dataframe

import yaml
import pandas as pd

with open(r'1000851.yaml') as file:
    df = pd.io.json.json_normalize(yaml.load(file))

df.head()

I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame.我想从一个目录中读取几个yaml文件到 pandas dataframe并将它们连接成一个大 Z388444BA115217A I have not been able to figure it out though...虽然我一直无法弄清楚......

import pandas as pd
import glob

path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")

li = []

for filename in all_files:
    df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Error错误

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    268 
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
    268 
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

AttributeError: 'str' object has no attribute 'values'

Sample Dataset Zipped压缩样本数据集

Sample Dataset 样本数据集

Is there a way to do this and read files efficiently?有没有办法做到这一点并有效地读取文件？

Answer 1

It seems your first part of the code and the second one you added is different.您的代码的第一部分和您添加的第二部分似乎不同。

First part correctly reads yaml files, but the second part is broken:第一部分正确读取 yaml 文件，但第二部分已损坏：

for filename in all_files:
    # `filename` here is just a string containing the name of the file. 
    df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
    li.append(df)

The problem is that you need to read the files.问题是您需要读取文件。 Currently you're just giving the filename and not the file content.目前你只是给出文件名而不是文件内容。 Do this instead改为这样做

li=[]
# Only loading 3 files:
for filename in all_files[:3]:
    with open(filename,'r') as fh:
        df = pd.json_normalize(yaml.safe_load(fh.read()))
    li.append(df)

len(li)
3

pd.concat(li)

output:
  
                                             innings  meta.data_version meta.created  meta.revision info.city info.competition  ... info.player_of_match                         info.teams info.toss.decision info.toss.winner              info.umpires                           info.venue
0  [{'1st innings': {'team': 'Glamorgan', 'delive...                0.9   2020-09-01              1   Bristol   Vitality Blast  ...          [AG Salter]       [Glamorgan, Gloucestershire]              field  Gloucestershire  [JH Evans, ID Blackwell]                        County Ground
0  [{'1st innings': {'team': 'Pune Warriors', 'de...                0.9   2013-05-19              1      Pune              IPL  ...          [LJ Wright]  [Pune Warriors, Delhi Daredevils]                bat    Pune Warriors    [NJ Llong, SJA Taufel]           Subrata Roy Sahara Stadium
0  [{'1st innings': {'team': 'Botswana', 'deliver...                0.9   2020-08-29              1  Gaborone              NaN  ...       [A Rangaswamy]              [Botswana, St Helena]                bat         Botswana   [R D'Mello, C Thorburn]  Botswana Cricket Association Oval 1

[3 rows x 18 columns]

读取多个 yaml 文件到 pandas Dataframe

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-28 16:39:02

读取多个 yaml 文件到 pandas Dataframe

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-28 16:39:02

解决方案1
1 已采纳 2020-12-28 16:39:02