[英]How do I exclude columns when exporting app engine data
I'm planning to do some data mining on my django app which uses appengine for storing data, however, one of my tables stores images in two of it's columns, and because of that, it is gigabytes in size so it's far too slow to download every time I want to analyse new data.我计划对我的 django 应用程序进行一些数据挖掘,该应用程序使用 appengine 来存储数据,但是,我的一个表将图像存储在其中的两个列中,因此,它的大小为千兆字节,因此速度太慢了每次我想分析新数据时都下载。 For data mining, I only care about the plan text columns in that table, how do I exclude those columns while exporting data to an csv file?
对于数据挖掘,我只关心该表中的计划文本列,如何在将数据导出到 csv 文件时排除这些列?
I'm aware that there is a "column_list" for the csv connector for buildupload.yaml that you can specify to only include certain columns when exporting data, but it looks like it still downloads the entire table row before filtering out the columns when it's converting appengine's intermediate sqlite3 data file to csv.我知道 buildupload.yaml 的 csv 连接器有一个“column_list”,您可以指定在导出数据时只包含某些列,但看起来它仍然会在过滤掉列之前下载整个表格行将 appengine 的中间 sqlite3 数据文件转换为 csv。
For reference, I'm using the method described here to download my data http://code.google.com/appengine/docs/python/tools/uploadingdata.html , but I'm open to other solutions, preferably ones where I can automate this data export every few days.作为参考,我使用此处描述的方法下载我的数据http://code.google.com/appengine/docs/python/tools/uploadingdata.html ,但我对其他解决方案持开放态度,最好是那些我可以每隔几天自动导出此数据。
You can't.你不能。 The AppEngine datastore API, and the underlying GQL, only do two sorts of SELECT queries:
__key__
only, and all fields. AppEngine 数据存储 API 和底层 GQL 仅执行两种 SELECT 查询:仅
__key__
和所有字段。 There's no way of getting a subset of fields.无法获得字段的子集。
Kind of late here but all I did in a similar situation was delete the unwanted property from the automatically generated bulkloader.yaml file.有点晚了,但我在类似情况下所做的只是从自动生成的 bulkloader.yaml 文件中删除不需要的属性。
Here is an example using the Google documentation to exclude the "account" property from the csv file.这是一个使用Google 文档从 csv 文件中排除“帐户”属性的示例。 I use it for things like blobs and it works fine there too:
我将它用于 blob 之类的东西,它在那里也能正常工作:
property_map:
- property: __key__
external_name: key
export_transform: transform.key_id_or_name_as_string
START DELETE
- property: account
external_name: account
# Type: Key Stats: 119 properties of this type in this kind.
import_transform: transform.create_foreign_key('TODO: fill in Kind name')
export_transform: transform.key_id_or_name_as_string
END DELETE
- property: invite_nonce
external_name: invite_nonce
# Type: String Stats: 19 properties of this type in this kind.
As you've observed, the bulkloader downloads the entire record using remote_api, then outputs only the fields you care about to the CSV.正如您所观察到的,bulkloader 使用 remote_api 下载整个记录,然后仅将您关心的字段输出到 CSV。 If you want to only download selected fields, you'll have to write your own code to do this on the server-side - possibly by using the new Files API in a mapreduce, to write a file you can then download.
如果您只想下载选定的字段,则必须编写自己的代码才能在服务器端执行此操作 - 可能通过在 mapreduce 中使用新的 Files API 来编写您可以下载的文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.