[英]Importing multiple files from Google Cloud Bucket to Datalab instance
I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3. 我在Google Cloud上设置了一个存储桶,其中包含数百个json文件,并正在尝试在运行python 3的datalab实例中使用它们。
So, I can easily see them as objects using 因此,我可以轻松地将它们视为对象
gcs list --objects gs://<BUCKET_NAME>
Further, I can read in an individual file/object using 此外,我可以使用读取单个文件/对象
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('<BUCKET_NAME')
data_csv = myBucket.object('<FILE_NAME.json')
uri = data_csv.uri
%gcs read --object $uri --variable data
df = pd.read_csv(BytesIO(data))
df.head()
(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own) (仅供参考,我知道我的示例正在将json读取为csv,但我们可以忽略它-我自己过桥)
What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? 我不知道是如何在存储桶中循环并将所有json文件拉入熊猫...我该怎么做? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)?
是我应该考虑的方式-是否可以直接从熊猫调用存储桶中的文件(因为它们已经被当作对象了)?
As an extra bit- what if a file is saved as a json, but isn't actually that structure? 另外,如果将文件另存为json,但实际上不是这种结构怎么办? How can I handle that?
我该如何处理?
Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab. 基本上,我想是在寻找blob程序包的功能,但要使用云存储桶+数据实验室。
Any help is greatly appreciated. 任何帮助是极大的赞赏。
This can be done using Bucket.objects
which returns an iterator with all matching files. 可以使用
Bucket.objects
来完成此操作, Bucket.objects
将返回带有所有匹配文件的迭代器。 Specify a prefix or leave it empty to match all files in the bucket. 指定前缀或将其保留为空以匹配存储桶中的所有文件。 I did an example with two files
countries1.csv
and countries2.csv
: 我用两个文件
countries1.csv
和countries2.csv
做了一个例子:
$ cat countries1.csv
id,country
1,sweden
2,spain
$ cat countries2.csv
id,country
3,italy
4,france
And used the following Datalab snippet: 并使用了以下Datalab代码段:
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')
df_list = []
for object in object_list:
%gcs read --object $object.uri --variable data
df_list.append(pd.read_csv(BytesIO(data)))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()
which will output the combined csv: 这将输出合并的csv:
id country
0 1 sweden
1 2 spain
2 3 italy
3 4 france
Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. 考虑到我使用这种方法将所有csv文件组合到单个Pandas数据框中,但是您可能需要根据用例将它们加载到不同的文件中。 If you want to retrieve all files in the bucket just use this instead:
如果要检索存储桶中的所有文件,请改用以下方法:
object_list = myBucket.objects()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.