简体   繁体   English

是否可以找到具有相同dom结构的节点

[英]Is it possible to find the nodes with same dom structure

I have crawled a lot of htmls(with similar content) from a lot of sites by Scrapy, while the dom structure are different. 我从Scrapy的许多网站中抓取了很多html(内容相似),而dom的结构却不同。

For example, one of the sites use the following structure: 例如,其中一个站点使用以下结构:

<div class="post">
    <section class='content'>
        Content1
    </section>

    <section class="panel">
    </section>
</div>
<div class="post">
    <section class='content'>
        Conent2
    </section>

    <section class="panel">
    </section>
</div>

And I want to extract the data Content and Content2 . 我想提取数据ContentContent2

While another site may use structure like this: 虽然另一个站点可能使用这样的结构:

<article class="entry">
    <section class='title'>
        Content3
    </section>
</article>
<article class="entry">
    <section class='title'>
        Conent4
    </section>
</article>

And I want to extract the data Content3 and Content4 . 我想提取数据Content3Content4

While the easiest solution is marking the required data xpath one by one for all the sites. 最简单的解决方案是为所有站点一一标记所需的数据xpath。 That would be a tedious job. 那将是一件乏味的工作。

So I wonder if the structure can be extracted automatically. 所以我想知道结构是否可以自动提取。 In fact, I just need to be located to the repeated root node( div.post and article.entry in the above example), then I can extract the data with some certain rules. 实际上,我只需要位于重复的根节点(在上面的示例中为div.postarticle.entry ),就可以使用某些特定规则提取数据。

Is this possible? 这可能吗?

BTW, I am not exactly sure the name of this kind of algorithms, so the tag of this post maybe wrong, feel free to modify that if true. 顺便说一句,我不确定这种算法的名称,所以这篇文章的标签可能是错误的,如果是真的,可以随意修改。

You have to know at least some common patterns to be able to formulate deterministic extraction rules. 您必须至少知道一些常见的模式才能制定确定性的提取规则。 The solution below is very primitive and by no means optimal, but it might help you: 以下解决方案非常原始,绝非最佳选择,但它可能会帮助您:

# -*- coding: utf-8 -*-
import re

import bs4
from bs4 import element
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        min_occurs = 5
        max_occurs = 1000
        min_depth = 7
        max_depth = 7
        pattern = re.compile('^/html/body/.*/(span|div)$')
        extract_content = lambda e: e.css('::text').extract_first()
        #extract_content = lambda e: ' '.join(e.css('*::text').extract())

        doc = bs4.BeautifulSoup(response.body, 'html.parser')

        paths = {}
        self._walk(doc, '', paths)
        paths = self._filter(paths, pattern, min_depth, max_depth,
                             min_occurs, max_occurs)

        for path in paths.keys():
            for e in response.xpath(path):
                yield {'content': extract_content(e)}

    def _walk(self, doc, parent, paths):
        for tag in doc.children:
            if isinstance(tag, element.Tag):
                path = parent + '/' + tag.name
                paths[path] = paths.get(path, 0) + 1
                self._walk(tag, path, paths)

    def _filter(self, paths, pattern, min_depth, max_depth, min_occurs, max_occurs):
        return dict((path, count) for path, count in paths.items()
                        if pattern.match(path) and
                                min_depth <= path.count('/') <= max_depth and
                                min_occurs <= count <= max_occurs)

It works like this: 它是这样的:

  1. Explore HTML document and construct dictionary of all element paths in the document together with their occurences. 浏览HTML文档,并构造文档中所有元素路径及其出现的字典。
  2. Filter those paths based on your general rules which you infer from your web pages. 根据您从网页推断出的一般规则过滤这些路径。
  3. Extract content from these filtered paths using some common extraction logic. 使用一些常见的提取逻辑从这些过滤的路径中提取内容。

For building the dictionary of paths I just walk though the document using BeautifulSoup and count occurence of each element path. 为了构建路径字典,我只是使用BeautifulSoup文档,并计算每个元素路径的出现次数。 This can later be used in filtering task for keeping only the most repearing paths. 以后可以在过滤任务中使用它,以仅保留最多重复的路径。

Next I filter out the paths based on some basic rules. 接下来,我根据一些基本规则过滤出路径。 For path to be kept, it has to: 为了保留路径,它必须:

  • Occur at least min_occurs and at most max_occurs times. 发生至少min_occurs ,最多max_occurs倍。
  • Has length of at least min_depth and at most max_depth . 长度至少为min_depth ,最大为max_depth
  • Match the pattern . 匹配pattern

Other rules can be added in similar fashion. 可以以类似方式添加其他规则。

The last part loops through the paths that left you after filtering and extracts content from elements using some common logic defined using extract_content . 最后一部分遍历过滤后留下的路径,并使用一些使用extract_content定义的通用逻辑从元素中提取内容。

If your web pages are rather simple and you could infer such rules, it might work. 如果您的网页非常简单,并且可以推断出此类规则,则可能会起作用。 Otherwise, you would have to look at some kind of machine learning task I guess. 否则,您将不得不考虑某种我认为是机器学习的任务。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找所有叶节点是否都在同一高度 - find whether all the leaf nodes are at the same height 用于区分具有相同名称的节点的正确图形数据结构是什么? - What is the correct graph data structure to differentiate between nodes with the same name? 如何使用networkX在社区分区结构中找到节点的度中心性? - how to find degree centrality of nodes in the community partitions structure using networkX? pydot:是否可以在其中绘制两个具有相同字符串的不同节点? - pydot: is it possible to plot two different nodes with the same string in them? 是否可以使用硒在DOM树中找到具有特定模式的元素? - Is it possible to find elements in a DOM tree having certain pattern using selenium? 使用Python Dom将节点添加到具有相同节点名但属性不同的xml文件中 - adding nodes to xml file with same node name but different attributes with Python Dom 如何在多孔结构中复杂路径骨架的无向网络X图中找到入口和出口节点? - How to find inlet and outlet nodes in undirected networkX graph of complicated skeleton of pathes in porous structure? 如何在 python 中找出两个具有相同结构的列表? - How to find out two lists with same structure in python? 是否有可能找到一个起始节点和多个目的节点之间的最短路径? 使用 Networkx 或 OSMnx - is it possible to find the shortest path between an origin node and multiple destination nodes? Using Networkx or OSMnx 在树中查找父节点 - find parent nodes in a tree
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM