是否可以找到具有相同dom结构的节点

Question

I have crawled a lot of htmls(with similar content) from a lot of sites by Scrapy, while the dom structure are different. 我从Scrapy的许多网站中抓取了很多html（内容相似），而dom的结构却不同。

For example, one of the sites use the following structure: 例如，其中一个站点使用以下结构：

<div class="post">
    <section class='content'>
        Content1
    </section>

    <section class="panel">
    </section>
</div>
<div class="post">
    <section class='content'>
        Conent2
    </section>

    <section class="panel">
    </section>
</div>

And I want to extract the data Content and Content2 . 我想提取数据Content和Content2 。

While another site may use structure like this: 虽然另一个站点可能使用这样的结构：

<article class="entry">
    <section class='title'>
        Content3
    </section>
</article>
<article class="entry">
    <section class='title'>
        Conent4
    </section>
</article>

And I want to extract the data Content3 and Content4 . 我想提取数据Content3和Content4 。

While the easiest solution is marking the required data xpath one by one for all the sites. 最简单的解决方案是为所有站点一一标记所需的数据xpath。 That would be a tedious job. 那将是一件乏味的工作。

So I wonder if the structure can be extracted automatically. 所以我想知道结构是否可以自动提取。 In fact, I just need to be located to the repeated root node( div.post and article.entry in the above example), then I can extract the data with some certain rules. 实际上，我只需要位于重复的根节点（在上面的示例中为div.post和article.entry ），就可以使用某些特定规则提取数据。

Is this possible? 这可能吗？

BTW, I am not exactly sure the name of this kind of algorithms, so the tag of this post maybe wrong, feel free to modify that if true. 顺便说一句，我不确定这种算法的名称，所以这篇文章的标签可能是错误的，如果是真的，可以随意修改。

Answer 1

You have to know at least some common patterns to be able to formulate deterministic extraction rules. 您必须至少知道一些常见的模式才能制定确定性的提取规则。 The solution below is very primitive and by no means optimal, but it might help you: 以下解决方案非常原始，绝非最佳选择，但它可能会帮助您：

# -*- coding: utf-8 -*-
import re

import bs4
from bs4 import element
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        min_occurs = 5
        max_occurs = 1000
        min_depth = 7
        max_depth = 7
        pattern = re.compile('^/html/body/.*/(span|div)$')
        extract_content = lambda e: e.css('::text').extract_first()
        #extract_content = lambda e: ' '.join(e.css('*::text').extract())

        doc = bs4.BeautifulSoup(response.body, 'html.parser')

        paths = {}
        self._walk(doc, '', paths)
        paths = self._filter(paths, pattern, min_depth, max_depth,
                             min_occurs, max_occurs)

        for path in paths.keys():
            for e in response.xpath(path):
                yield {'content': extract_content(e)}

    def _walk(self, doc, parent, paths):
        for tag in doc.children:
            if isinstance(tag, element.Tag):
                path = parent + '/' + tag.name
                paths[path] = paths.get(path, 0) + 1
                self._walk(tag, path, paths)

    def _filter(self, paths, pattern, min_depth, max_depth, min_occurs, max_occurs):
        return dict((path, count) for path, count in paths.items()
                        if pattern.match(path) and
                                min_depth <= path.count('/') <= max_depth and
                                min_occurs <= count <= max_occurs)

It works like this: 它是这样的：

Explore HTML document and construct dictionary of all element paths in the document together with their occurences. 浏览HTML文档，并构造文档中所有元素路径及其出现的字典。
Filter those paths based on your general rules which you infer from your web pages. 根据您从网页推断出的一般规则过滤这些路径。
Extract content from these filtered paths using some common extraction logic. 使用一些常见的提取逻辑从这些过滤的路径中提取内容。

For building the dictionary of paths I just walk though the document using BeautifulSoup and count occurence of each element path. 为了构建路径字典，我只是使用BeautifulSoup文档，并计算每个元素路径的出现次数。 This can later be used in filtering task for keeping only the most repearing paths. 以后可以在过滤任务中使用它，以仅保留最多重复的路径。

Next I filter out the paths based on some basic rules. 接下来，我根据一些基本规则过滤出路径。 For path to be kept, it has to: 为了保留路径，它必须：

Occur at least min_occurs and at most max_occurs times. 发生至少min_occurs ，最多max_occurs倍。
Has length of at least min_depth and at most max_depth . 长度至少为min_depth ，最大为max_depth 。
Match the pattern . 匹配pattern 。

Other rules can be added in similar fashion. 可以以类似方式添加其他规则。

The last part loops through the paths that left you after filtering and extracts content from elements using some common logic defined using extract_content . 最后一部分遍历过滤后留下的路径，并使用一些使用extract_content定义的通用逻辑从元素中提取内容。

If your web pages are rather simple and you could infer such rules, it might work. 如果您的网页非常简单，并且可以推断出此类规则，则可能会起作用。 Otherwise, you would have to look at some kind of machine learning task I guess. 否则，您将不得不考虑某种我认为是机器学习的任务。

是否可以找到具有相同dom结构的节点

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-07-28 06:42:32

是否可以找到具有相同dom结构的节点

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-07-28 06:42:32

解决方案1
3 已采纳 2017-07-28 06:42:32