使用关键字分隔 HTML 文件以进行抓取

Question

I am programming in Python with Scrapy and have a huge html file with a structure similar to the one demonstrated below:我正在使用 Scrapy 在 Python 中编程，并且有一个巨大的html文件，其结构类似于下面演示的结构：

<span>keyword</span>
<title>Title 1</title>
<span>Date 1</span>
<div>Content 1</div>

<span>keyword</span>
<title>Title 2</title>
<span>Date 2</span>
<div>Content 2</div>

...

<span>keyword</span>
<title>Title N</title>
<span>Date N</span>
<div>Content N</div>

My goal is to get all the title , date , and contents inside div for each section, but the sections themselves are not located in separated div or other elements, just one after another until the N-th section.我的目标是获取每个部分的div所有title 、 date和内容，但这些部分本身并不位于单独的div或其他元素中，只是一个接一个，直到第 N 个部分。

While I can try to get all the title[1:N] , date[1:N] , and div[1:N] as a list with len() = N , doing so prevent debugging as if N goes to 10,000 and len(title)==len(date)==len(div) -> False , it will be hard to find where thing goes wrong (eg some titles are put in  instead of <title> ).虽然我可以尝试将所有的title[1:N] 、 date[1:N]和div[1:N]作为len() = N的列表，这样做会阻止调试，好像N达到 10,000 和len(title)==len(date)==len(div) -> False ，很难找到哪里出错了（例如有些标题放在而不是<title> ）。

One item I noticed is to the keyword located between each section.我注意到的一项是位于每个部分之间的关键字。 With the help of that keyword, is it possible to separate the entire html into N parts, and hopefully get item[i] = ["Title_i", "Date_i", "DIV_i"] for each section through iteration?借助该关键字，是否可以将整个html分成 N 个部分，并希望通过迭代为每个部分获得item[i] = ["Title_i", "Date_i", "DIV_i"] ？

This way missing data will be represented as item[1]=["", Date_i, Div_i ] and will be much easier to locate.这样丢失的数据将表示为item[1]=["", Date_i, Div_i ]并且更容易定位。

Answer 1

Carl, you may try to split html file content into concise parts by keywords. Carl，您可以尝试通过关键字将 html 文件内容拆分为简洁的部分。

You should be able to know a full set/dictionary of all possible keywords.您应该能够了解所有可能关键字的完整集合/字典。
Some keywords might be duplicated inside of any Content part... so you'd better split not with pure keyword values, nor with keyword expression but with the most unique keyword\\s*<title> and keyword expressions.某些关键字可能会在任何Content部分内重复...因此您最好不要使用纯关键字值或keyword表达式进行拆分，而是使用最独特的keyword\\s*<title>和keyword表达式。 Thus you split parts correctly with a big probability.因此，您很有可能正确拆分零件。

使用关键字分隔 HTML 文件以进行抓取

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-10-06 06:27:36

使用关键字分隔 HTML 文件以进行抓取

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-10-06 06:27:36

解决方案1
0 已采纳 2016-10-06 06:27:36