简体   繁体   中英

Most used words on website using Solr etc

I want to generate a list of the most words used on a website. The application should crawl the content of the site. Does anyone know if this can be done by Solr or any other technique?

The list can be php objects/array or an xml file.

you might want to check http://wiki.apache.org/solr/TermsComponent

Example -

http://host:port/solr/core/terms?terms.fl=title&terms.sort=count

Will give you all the terms for the field title ordered by count (default)

terms.fl - Field you want to check the terms on 
terms.sort={count|index} - If count, sorts the terms by the term frequency (highest count first). If index, returns the terms in index order. Default is to sort by count.

This gives the indexed terms which go through the tokenizer and filters, so if you need terms as is, you can vary the field analysis. (probably use field type string)

SOLR is a search engine. It doesn't crawl websites. You need to make a simple website crawler using scrapy http://scrapy.org/ or some similar tool. Design a SOLR schema to record the data, crawl the websites, send record updates to SOLR. Your specific question would probably be answered by the SCHEMA BROWSER choice on the SOLR admin menu through the web admin interface. Click on DYNAMIC FIELDS, select the field you are interested and see the to 10. Change the number to 50, press ENTER and get the top 50.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM