简体   繁体   English

将GSA配置为仅抓取文件的元数据,而不抓取内容

[英]Configure GSA to only crawl metadata of files, not content

In GSA (Google Search Appliance), I am looking into how I can have it crawl only the metadata (name, type, size, last modified, etc) and not the content of the file. 在GSA(Google Search Appliance)中,我正在研究如何仅对元数据(名称,类型,大小,最后修改的内容等)进行爬网,而不对文件的内容进行爬网。 While I realize this can affect the usefulness of the results, I have my requirements. 虽然我意识到这会影响结果的实用性,但我有自己的要求。

It comes down to the metadata of the file is public, but the content of the file is restricted. 归结为文件的元数据是公共的,但是文件的内容受到限制。 While this seems like a security-trimmed question, it's slightly more because I don't want GSA to store ANY information on the content of the file in the index. 尽管这似乎是一个安全性受到限制的问题,但它的含义略多一些,因为我不希望GSA在索引中存储有关文件内容的任何信息。 Assume the GSA server is untrusted to hold the content. 假设不信任GSA服务器保存内容。 This is only for a small subset of the whole dataset. 这仅适用于整个数据集的一小部分。

Any ideas on how I could configure GSA and connectors to only crawl the metadata and not the content? 关于如何配置GSA和连接器仅抓取元数据而不抓取内容的任何想法?

Not sure if you can do this from crawling files(on fileshare or on website) You can however do this by crawling a DB with the columns containing the meta data or by developing a connector that only creates a feed that provides the meta data. 不知道你是否能抓取文件(在文件共享或网站)做到这一点但是,您可以通过抓取包含元数据的列的数据库或通过开发只创建一个饲料 ,提供了元数据的连接器做到这一点。

This will work if you have the metadata stored somewhere, but not together in the file. 如果您将元数据存储在某个地方而不是一起存储在文件中,那么这将起作用。

Another option is to customise your front end to not provide a link to the document and just configure the meta data to be displayed in the result. 另一种选择是自定义前端,以不提供到文档的链接,而仅配置要在结果中显示的元数据。 (Use 1 in the FrontEnd to automatically display the meta data fields) You will also need to add the ' getfields ' parameter in the search query to include the relevant meta data fields. (在FrontEnd中使用1自动显示元数据字段)。您还需要在搜索查询中添加“ getfields ”参数以包括相关的元数据字段。

This works for a DB scenario. 这适用于数据库方案。 Have not tested it with file meta data, but should work. 尚未使用文件元数据对其进行测试,但是应该可以使用。

Duncan de Klerk Conor 邓肯·德·克拉克·科纳尔

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM