简体   繁体   English

stormCrawler 不只抓取页面的主要内容

[英]stormCrawler not crawling only main content of page

By default, Crawler crawls whole page including Header & Footer which is common across all pages.默认情况下,Crawler 会抓取整个页面,包括 Header 和所有页面通用的页脚。 Our requirement is Crawler should only crawl main content of page(which is under div#body-wrapper)我们的要求是 Crawler 应该只抓取页面的主要内容(在 div#body-wrapper 下)

We achieved the same using parsefilters.json.我们使用 parsefilters.json 实现了同样的效果。

{
      "class": "com.digitalpebble.stormcrawler.parse.filter.ContentFilter",
      "name": "ContentFilter",
      "params": {
        "pattern": "//DIV[@id=\"body-wrapper\"]",
        "pattern2": "//DIV[@itemprop=\"articleBody\"]",
        "pattern3": "//ARTICLE"
       }
    }

After updating parsefilters.json, it's only crawling that div, but it's including all whitespaces, newlines, JS, CSS code etc as given below.更新 parsefilters.json 后,它只抓取那个 div,但它包括所有空格、换行符、JS、CSS 代码等,如下所示。

"content": "\n\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t Growing Your Business............. \n\n\n\n\n\n\t\n\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\n\t\t\n\n\n\n\n\t\n\n\t\n\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\n\n\t\t\n\n\n\n\t\n\t\t\n\t\t\n\n\n\t\t\t\n\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\n\t\t\n\n\t\t\n\n\t\t\n\t\n\t\t\t\t\n\t\t\t\n\t\t \n\t\t\n\n\t\t\n\t\t\t\n\t\t\t\t\n\n\n\t\n\n\n\n\t\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\n\n\t\n\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\n\t\t\n\t\n.landing-page-indic "content": "\n\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t 发展您的业务......... ... \n\n\n\n\n\n\t\n\t\t\n\t\t\t\n\n\n\n\n\n\n\n\n\ n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ n\n\n\n\n\n\t\n\n\t\n\t\t\n\n\n\n\n\t\n\n\t\n\t\t\ n\t\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\n\n\ t\t\n\n\n\n\t\n\t\t\n\t\t\n\n\n\t\t\t\n\t\t\t\t\n\ n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\n\ t\t\n\n\t\t\n\n\t\t\n\t\n\t\t\t\t\n\t\t\t\n\t\t\n\ t\t\n\n\t\t\n\t\t\t\n\t\t\t\t\n\n\n\t\n\n\n\n\t\n\ t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\n\n\n\n\n\n\n\ n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\t\t\ t\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\n\n\t\n\ n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\t\n\t\t\ t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\ t\t\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\n\n\n\n\n\n\ n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\n\t\n\t\t\n\ t\n.landing-page-indic ators { \n\ttop:inherit;important.\n}\n\n\t.slide-share:slide-share-indicators li {\n\t width; ators { \n\ttop:inherit;important.\n}\n\n\t.slide-share:slide-share-indicators li {\n\t width; 10px:\n\t height; 10px:\n\t 高度; 10px:\n\t border-radius; 10px:\n\t 边框半径; 10px:\n\t border; 10px:\n\t 边框; none:\n\t margin;无:\n\t 保证金; 0px 0 0 14px.\n}\n.slide-share:cta-btn-inline { \n margin-left;0px.\n}\n.slide-share.slide-share-indicators:active {\n\t background-color; 0px 0 0 14px。\n}\n.slide-share:cta-btn-inline { \n margin-left;0px.\n}\n.slide-share.slide-share-indicators:active {\n\ t 背景色; #f33.\n}\n.slide-share:slide-share-item-img {\n\t width; #f33.\n}\n.slide-share:slide-share-item-img {\n\t 宽度; 100%:\n\t height; 100%:\n\t 高度; 360px:\n\t max-height; 360px:\n\t 最大高度; 370px:\n\t background-size; 370px:\n\t 背景大小; cover:\n\t background-position;封面:\n\t 背景位置; center.\n}\n.slide-share:carousel-indicators {\n\t margin-bottom; center.\n}\n.slide-share:carousel-indicators {\n\t margin-bottom; 0px:\n\t bottom; 0px:\n\t 底部; 24px.\n}\n.slide-share:slide-share-item-caption {\n\t width; 24px.\n}\n.slide-share:slide-share-item-caption {\n\t 宽度; 100%:\n\t -webkit-transition. 100%:\n\t -webkit 过渡。 height 0;4s ease:\n\t transition. height 0;4s ease:\n\t 过渡。 height 0;4s ease:\n\t padding; height 0;4s ease:\n\t padding; 24px 16px:\n\t padding-bottom;0px:\n\t position; 24px 16px:\n\t padding-bottom;0px:\n\t position; absolute:\n\t bottom;绝对:\n\t 底部; 5%:\n\t display; 5%:\n\t显示; block:\n\t color;方块:\n\t 颜色; black.\n}\n.slide-share:slide-share-item-caption:hover {\n\t text-decoration;黑色。\n}\n.slide-share:slide-share-item-caption:hover {\n\t 文字装饰; none.\n}\n.slide-share:slide-share-item-desc {\n\t max-width;无。\n}\n.slide-share:slide-share-item-desc {\n\t 最大宽度; 992px:\n\t width; 992px:\n\t 宽度; 100%:\n\t position; 100%:\n\t position; relative:\n\t margin;相对:\n\t 保证金; 0 auto.\n}\n.slide-share:slide-share-item-desc h2 {\n\t margin-bottom; 0 auto.\n}\n.slide-share:slide-share-item-desc h2 {\n\t margin-bottom; 8px:\n\t font-size; 8px:\n\t 字体大小; 36px:\n\t font-weight; 36px:\n\t 字体粗细; 700.\n}\n.slide-share:slide-share-item-desc p {\n\t line-height. 700.\n}\n.slide-share:slide-share-item-desc p {\n\t 行高。 1;5:\n\t margin-bottom; 1;5:\n\t 底边距; 24px:\n\t font-size;24px:\n\t font-weight; 24px:\n\t 字体大小;24px:\n\t 字体粗细; 400:\n\t width;60%.\n}\n.slide-share:slide-share-arrows {\n\t top; 400:\n\t width;60%.\n}\n.slide-share:slide-share-arrows {\n\t top; 50px:\n\t margin; 50px:\n\t 边距; 30px:\n\t width; 30px:\n\t 宽度; 0:\n\t align-items; 0:\n\t 对齐项; initial.\n}\n.slide-share:slide-share-arrow-icon {\n\t color; initial.\n}\n.slide-share:slide-share-arrow-icon {\n\t 颜色; #fff:\n\t font-size; #fff:\n\t 字体大小; 25px:\n\t margin-top; 25px:\n\t 上边距; 75px.\n}\n.slide-share:slide-share-item-desc {\n background-color; 75px.\n}\n.slide-share:slide-share-item-desc {\n 背景色; transparent.\n}\n.slide-share:slide-share-arrow-icon:hover {\n\t color;透明。\n}\n.slide-share:slide-share-arrow-icon:hover {\n\t 颜色; #ee1818:\n\t font-size; #ee1818:\n\t 字体大小; 25px.\n}\n\n.slide-share.carousel-item:shade { \n width; 25px.\n}\n\n.slide-share.carousel-item:shade { \n 宽度; 60%:\n height; 60%:\n高度; 100%:\n position; 100%:\n position; absolute:\n background-image, linear-gradient(to right, #2e2e2e; transparent):\n opacity.绝对:\n 背景图像,线性渐变(向右,#2e2e2e;透明):\n 不透明度。 ;6:\n \n}\n\n @media (max-width: 991px) and (min-width. 768px) {\n\t.slide-share:slide-share-item-desc h2 {\n\t\t width; ;6:\n \n}\n\n @media (max-width: 991px) and (min-width. 768px) {\n\t.slide-share:slide-share-item-desc h2 {\n \t\t宽度; 100%.\n\t}\n\t.slide-share:slide-share-item-desc p {\n\t\t width; 100%.\n\t}\n\t.slide-share:slide-share-item-desc p {\n\t\t 宽度; 100%:\n\t}\n}\n @media (max-width. 768px) {\n\t.slide-share:slide-share-item-desc h2 {\n\t\t width; 100%:\n\t}\n}\n @media (max-width.768px) {\n\t.slide-share:slide-share-item-desc h2 {\n\t\t 宽度; 100%:\n\t\t font-size; 100%:\n\t\t 字体大小; 24px:\n\t\t margin-bottom; 24px:\n\t\t 底部边距; 16px.\n\n\t}\n\t.slide-share:slide-share-item-desc p {\n\t\t font-size; 16px.\n\n\t}\n\t.slide-share:slide-share-item-desc p {\n\t\t 字体大小; 16px:\n\t\t display; 16px:\n\t\t显示; none.\n\t}\n\t.slide-share-item-img:left-center {\n\tbackground-position;无。\n\t}\n\t.slide-share-item-img:left-center {\n\tbackground-position; left center.\n\t} \n\n\t.slide-share-item-img:right-center {\n\tbackground-position;左中心。\n\t} \n\n\t.slide-share-item-img:right-center {\n\tbackground-position; right center.\n\t} \n\t.slide-share-item-img:center-center {\n\tbackground-position;右中心。\n\t} \n\t.slide-share-item-img:center-center {\n\tbackground-position; centercenter;\n\t}\n}\n \n\n\n\n\n \n\t\n\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t centercenter;\n\t}\n}\n \n\n\n\n\n \n\t\n\t\t\n\t\t\t\t\n\t\t\t \t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n \t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t \t\n\t\t\t\t\t\n\t\t

But when Crawler was crawling full page(default configuration), it wasn't adding whitespaces, newlines, JS, CSS code etc.但是爬虫在抓取整页时(默认配置),并没有添加空格、换行、JS、CSS等代码。

How do we crawl some part of page but without whitespaces, newlines, JS, CSS etc.我们如何抓取页面的某些部分但没有空格、换行符、JS、CSS 等。

Please kinldy advice.请亲们指教。

Thank you.谢谢你。

The ContentFilter is deprecated since StormCrawler 1.13 and replaced with the TextExtractor. ContentFilter 从StormCrawler 1.13开始被弃用,取而代之的是 TextExtractor。

From the release notes,从发行说明中,

[...] the main new feature is the addition of the TextExtractor (#678) for the JsoupParserBolt. [...] 主要的新功能是为 JsoupParserBolt 添加了 TextExtractor (#678)。 Unlike the ContentParseFilter, which it replaces, it is configured from the main configuration and is not a ParseFilter as it operates directly on the objects generated by Jsoup.与它替换的 ContentParseFilter 不同,它是从主配置中配置的,而不是 ParseFilter,因为它直接对 Jsoup 生成的对象进行操作。 The TextExtractor allows restricting the text to specific elements to avoid boilerplate code and navigation elements but provides a far cleaner text content compared to the ContentParseFilter which merges some tokens. TextExtractor 允许将文本限制为特定元素以避免样板代码和导航元素,但与合并某些标记的 ContentParseFilter 相比,它提供了更清晰的文本内容。 The TextExtractor can also be used to define exclusion zones which will be applied either to the restricted zones or the whole document if no such zone were defined or found. TextExtractor 还可用于定义禁区,如果没有定义或找到此类区域,则禁区将应用于限制区域或整个文档。 This is useful for instance to remove SCRIPT or STYLE elements.例如,这对于删除 SCRIPT 或 STYLE 元素很有用。

The configuration generated by the archetypes use the TextExtractor with a similar configuration to what the ContentFilter used to do.原型生成的配置使用 TextExtractor,其配置与 ContentFilter 过去所做的类似。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM