简体   繁体   English

读取多个 yaml 文件到 pandas Dataframe

[英]Read multiple yaml files to pandas Dataframe

I do realize this has already been addressed here (eg, Reading csv zipped files in python , How can I parse a YAML file in Python , Retrieving data from a yaml file based on a Python list ). I do realize this has already been addressed here (eg, Reading csv zipped files in python , How can I parse a YAML file in Python , Retrieving data from a yaml file based on a Python list ). Nevertheless, I hope this question was different.不过,我希望这个问题有所不同。

I know loading a YAML file to pandas dataframe我知道将YAML文件加载到 pandas dataframe

import yaml
import pandas as pd

with open(r'1000851.yaml') as file:
    df = pd.io.json.json_normalize(yaml.load(file))

df.head()

I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame.我想从一个目录中读取几个yaml文件到 pandas dataframe并将它们连接成一个大 Z388444BA115217A I have not been able to figure it out though...虽然我一直无法弄清楚......

import pandas as pd
import glob

path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")

li = []

for filename in all_files:
    df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Error错误

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<timed exec> in <module>

/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    268 
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
    268 
    269     if record_path is None:
--> 270         if any([isinstance(x, dict) for x in y.values()] for y in data):
    271             # naive normalization, this is idempotent for flat records
    272             # and potentially will inflate the data considerably for

AttributeError: 'str' object has no attribute 'values'

Sample Dataset Zipped压缩样本数据集

Sample Dataset 样本数据集

Is there a way to do this and read files efficiently?有没有办法做到这一点并有效地读取文件?

It seems your first part of the code and the second one you added is different.您的代码的第一部分和您添加的第二部分似乎不同。

First part correctly reads yaml files, but the second part is broken:第一部分正确读取 yaml 文件,但第二部分已损坏:

for filename in all_files:
    # `filename` here is just a string containing the name of the file. 
    df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
    li.append(df)

The problem is that you need to read the files.问题是您需要读取文件。 Currently you're just giving the filename and not the file content.目前你只是给出文件名而不是文件内容。 Do this instead改为这样做

li=[]
# Only loading 3 files:
for filename in all_files[:3]:
    with open(filename,'r') as fh:
        df = pd.json_normalize(yaml.safe_load(fh.read()))
    li.append(df)

len(li)
3

pd.concat(li)

output:
  
                                             innings  meta.data_version meta.created  meta.revision info.city info.competition  ... info.player_of_match                         info.teams info.toss.decision info.toss.winner              info.umpires                           info.venue
0  [{'1st innings': {'team': 'Glamorgan', 'delive...                0.9   2020-09-01              1   Bristol   Vitality Blast  ...          [AG Salter]       [Glamorgan, Gloucestershire]              field  Gloucestershire  [JH Evans, ID Blackwell]                        County Ground
0  [{'1st innings': {'team': 'Pune Warriors', 'de...                0.9   2013-05-19              1      Pune              IPL  ...          [LJ Wright]  [Pune Warriors, Delhi Daredevils]                bat    Pune Warriors    [NJ Llong, SJA Taufel]           Subrata Roy Sahara Stadium
0  [{'1st innings': {'team': 'Botswana', 'deliver...                0.9   2020-08-29              1  Gaborone              NaN  ...       [A Rangaswamy]              [Botswana, St Helena]                bat         Botswana   [R D'Mello, C Thorburn]  Botswana Cricket Association Oval 1

[3 rows x 18 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM