将多个文件从Google Cloud Bucket导入到Datalab实例

Question

I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3. 我在Google Cloud上设置了一个存储桶，其中包含数百个json文件，并正在尝试在运行python 3的datalab实例中使用它们。

So, I can easily see them as objects using 因此，我可以轻松地将它们视为对象

gcs list --objects gs://<BUCKET_NAME>

Further, I can read in an individual file/object using 此外，我可以使用读取单个文件/对象

 import google.datalab.storage as storage
 import pandas as pd
 from io import BytesIO

 myBucket = storage.Bucket('<BUCKET_NAME')
 data_csv = myBucket.object('<FILE_NAME.json')

 uri = data_csv.uri
 %gcs read --object $uri --variable data

 df = pd.read_csv(BytesIO(data))
 df.head()

(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own) （仅供参考，我知道我的示例正在将json读取为csv，但我们可以忽略它-我自己过桥）

What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? 我不知道是如何在存储桶中循环并将所有json文件拉入熊猫...我该怎么做？ Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)? 是我应该考虑的方式-是否可以直接从熊猫调用存储桶中的文件（因为它们已经被当作对象了）？

As an extra bit- what if a file is saved as a json, but isn't actually that structure? 另外，如果将文件另存为json，但实际上不是这种结构怎么办？ How can I handle that? 我该如何处理？

Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab. 基本上，我想是在寻找blob程序包的功能，但要使用云存储桶+数据实验室。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

Answer 1

This can be done using Bucket.objects which returns an iterator with all matching files. 可以使用Bucket.objects来完成此操作， Bucket.objects将返回带有所有匹配文件的迭代器。 Specify a prefix or leave it empty to match all files in the bucket. 指定前缀或将其保留为空以匹配存储桶中的所有文件。 I did an example with two files countries1.csv and countries2.csv : 我用两个文件countries1.csv和countries2.csv做了一个例子：

$ cat countries1.csv
id,country
1,sweden
2,spain

$ cat countries2.csv
id,country
3,italy
4,france

And used the following Datalab snippet: 并使用了以下Datalab代码段：

import google.datalab.storage as storage
import pandas as pd
from io import BytesIO

myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')

df_list = []

for object in object_list:
  %gcs read --object $object.uri --variable data  
  df_list.append(pd.read_csv(BytesIO(data)))

concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()

which will output the combined csv: 这将输出合并的csv：

    id  country
0   1   sweden
1   2   spain
2   3   italy
3   4   france

Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. 考虑到我使用这种方法将所有csv文件组合到单个Pandas数据框中，但是您可能需要根据用例将它们加载到不同的文件中。 If you want to retrieve all files in the bucket just use this instead: 如果要检索存储桶中的所有文件，请改用以下方法：

object_list = myBucket.objects()

将多个文件从Google Cloud Bucket导入到Datalab实例

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-05-06 18:38:42

将多个文件从Google Cloud Bucket导入到Datalab实例

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-05-06 18:38:42

解决方案1
1 已采纳 2018-05-06 18:38:42