简体   繁体   English

将多个文件从Google Cloud Bucket导入到Datalab实例

[英]Importing multiple files from Google Cloud Bucket to Datalab instance

I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3. 我在Google Cloud上设置了一个存储桶,其中包含数百个json文件,并正在尝试在运行python 3的datalab实例中使用它们。

So, I can easily see them as objects using 因此,我可以轻松地将它们视为对象

gcs list --objects gs://<BUCKET_NAME>

Further, I can read in an individual file/object using 此外,我可以使用读取单个文件/对象

 import google.datalab.storage as storage
 import pandas as pd
 from io import BytesIO

 myBucket = storage.Bucket('<BUCKET_NAME')
 data_csv = myBucket.object('<FILE_NAME.json')

 uri = data_csv.uri
 %gcs read --object $uri --variable data

 df = pd.read_csv(BytesIO(data))
 df.head()

(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own) (仅供参考,我知道我的示例正在将json读取为csv,但我们可以忽略它-我自己过桥)

What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? 我不知道是如何在存储桶中循环并将所有json文件拉入熊猫...我该怎么做? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)? 是我应该考虑的方式-是否可以直接从熊猫调用存储桶中的文件(因为它们已经被当作对象了)?

As an extra bit- what if a file is saved as a json, but isn't actually that structure? 另外,如果将文件另存为json,但实际上不是这种结构怎么办? How can I handle that? 我该如何处理?

Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab. 基本上,我想是在寻找blob程序包的功能,但要使用云存储桶+数据实验室。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

This can be done using Bucket.objects which returns an iterator with all matching files. 可以使用Bucket.objects来完成此操作, Bucket.objects将返回带有所有匹配文件的迭代器。 Specify a prefix or leave it empty to match all files in the bucket. 指定前缀或将其保留为空以匹配存储桶中的所有文件。 I did an example with two files countries1.csv and countries2.csv : 我用两个文件countries1.csvcountries2.csv做了一个例子:

$ cat countries1.csv
id,country
1,sweden
2,spain

$ cat countries2.csv
id,country
3,italy
4,france

And used the following Datalab snippet: 并使用了以下Datalab代码段:

import google.datalab.storage as storage
import pandas as pd
from io import BytesIO

myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')

df_list = []

for object in object_list:
  %gcs read --object $object.uri --variable data  
  df_list.append(pd.read_csv(BytesIO(data)))

concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()

which will output the combined csv: 这将输出合并的csv:

    id  country
0   1   sweden
1   2   spain
2   3   italy
3   4   france

Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. 考虑到我使用这种方法将所有csv文件组合到单个Pandas数据框中,但是您可能需要根据用例将它们加载到不同的文件中。 If you want to retrieve all files in the bucket just use this instead: 如果要检索存储桶中的所有文件,请改用以下方法:

object_list = myBucket.objects()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将多个CSV文件从Google Cloud Bucket导入到Datalab - Importing multiple CSV files from Google Cloud Bucket to Datalab Google Datalab从云存储中读取 - Google Datalab read from cloud storage 努力从 Google Cloud Storage 存储桶中读取 csv 文件 - Struggling to read csv files from Google Cloud Storage bucket 如何使用 Google Cloud Function 将文件从 Cloud Storage 存储桶推送到实例中? - How can I use a Google Cloud Function to push a file from a Cloud Storage bucket into an instance? 无法从Google云存储桶读取.json - Cannot read .json from a google cloud bucket 从存储桶Google Cloud加载数据 - Load data from bucket google cloud Python3 中的 Cloud Function - 从 Google Cloud Bucket 复制到另一个 Google Cloud Bucket - Cloud Function in Python3 - copy from Google Cloud Bucket to another Google Cloud Bucket 从 Google Cloud Storage Bucket 提供静态文件(用于 GCE 上托管的 Django 应用) - Serve Static files from Google Cloud Storage Bucket (for Django App hosted on GCE) 使用 python 获取某个文件后,如何从 Google 云存储桶中获取文件? - How do you fetch files from Google cloud storage bucket after a certain file is fetched using python? 一次将多个文件从谷歌驱动器上传到谷歌云存储 - upload multiple files at once from google drive to google cloud storage
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM