[英]Is it possible to find the nodes with same dom structure
I have crawled a lot of htmls(with similar content) from a lot of sites by Scrapy, while the dom structure are different. 我从Scrapy的许多网站中抓取了很多html(内容相似),而dom的结构却不同。
For example, one of the sites use the following structure: 例如,其中一个站点使用以下结构:
<div class="post">
<section class='content'>
Content1
</section>
<section class="panel">
</section>
</div>
<div class="post">
<section class='content'>
Conent2
</section>
<section class="panel">
</section>
</div>
And I want to extract the data Content
and Content2
. 我想提取数据
Content
和Content2
。
While another site may use structure like this: 虽然另一个站点可能使用这样的结构:
<article class="entry">
<section class='title'>
Content3
</section>
</article>
<article class="entry">
<section class='title'>
Conent4
</section>
</article>
And I want to extract the data Content3
and Content4
. 我想提取数据
Content3
和Content4
。
While the easiest solution is marking the required data xpath one by one for all the sites. 最简单的解决方案是为所有站点一一标记所需的数据xpath。 That would be a tedious job.
那将是一件乏味的工作。
So I wonder if the structure can be extracted automatically. 所以我想知道结构是否可以自动提取。 In fact, I just need to be located to the repeated root node(
div.post
and article.entry
in the above example), then I can extract the data with some certain rules. 实际上,我只需要位于重复的根节点(在上面的示例中为
div.post
和article.entry
),就可以使用某些特定规则提取数据。
Is this possible? 这可能吗?
BTW, I am not exactly sure the name of this kind of algorithms, so the tag of this post maybe wrong, feel free to modify that if true. 顺便说一句,我不确定这种算法的名称,所以这篇文章的标签可能是错误的,如果是真的,可以随意修改。
You have to know at least some common patterns to be able to formulate deterministic extraction rules. 您必须至少知道一些常见的模式才能制定确定性的提取规则。 The solution below is very primitive and by no means optimal, but it might help you:
以下解决方案非常原始,绝非最佳选择,但它可能会帮助您:
# -*- coding: utf-8 -*-
import re
import bs4
from bs4 import element
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
min_occurs = 5
max_occurs = 1000
min_depth = 7
max_depth = 7
pattern = re.compile('^/html/body/.*/(span|div)$')
extract_content = lambda e: e.css('::text').extract_first()
#extract_content = lambda e: ' '.join(e.css('*::text').extract())
doc = bs4.BeautifulSoup(response.body, 'html.parser')
paths = {}
self._walk(doc, '', paths)
paths = self._filter(paths, pattern, min_depth, max_depth,
min_occurs, max_occurs)
for path in paths.keys():
for e in response.xpath(path):
yield {'content': extract_content(e)}
def _walk(self, doc, parent, paths):
for tag in doc.children:
if isinstance(tag, element.Tag):
path = parent + '/' + tag.name
paths[path] = paths.get(path, 0) + 1
self._walk(tag, path, paths)
def _filter(self, paths, pattern, min_depth, max_depth, min_occurs, max_occurs):
return dict((path, count) for path, count in paths.items()
if pattern.match(path) and
min_depth <= path.count('/') <= max_depth and
min_occurs <= count <= max_occurs)
It works like this: 它是这样的:
For building the dictionary of paths I just walk though the document using BeautifulSoup
and count occurence of each element path. 为了构建路径字典,我只是使用
BeautifulSoup
文档,并计算每个元素路径的出现次数。 This can later be used in filtering task for keeping only the most repearing paths. 以后可以在过滤任务中使用它,以仅保留最多重复的路径。
Next I filter out the paths based on some basic rules. 接下来,我根据一些基本规则过滤出路径。 For path to be kept, it has to:
为了保留路径,它必须:
min_occurs
and at most max_occurs
times. min_occurs
,最多max_occurs
倍。 min_depth
and at most max_depth
. min_depth
,最大为max_depth
。 pattern
. pattern
。 Other rules can be added in similar fashion. 可以以类似方式添加其他规则。
The last part loops through the paths that left you after filtering and extracts content from elements using some common logic defined using extract_content
. 最后一部分遍历过滤后留下的路径,并使用一些使用
extract_content
定义的通用逻辑从元素中提取内容。
If your web pages are rather simple and you could infer such rules, it might work. 如果您的网页非常简单,并且可以推断出此类规则,则可能会起作用。 Otherwise, you would have to look at some kind of machine learning task I guess.
否则,您将不得不考虑某种我认为是机器学习的任务。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.