[英]Separating HTML file with keyword for scraping
I am programming in Python with Scrapy and have a huge html
file with a structure similar to the one demonstrated below:我正在使用 Scrapy 在 Python 中编程,并且有一个巨大的
html
文件,其结构类似于下面演示的结构:
<span>keyword</span>
<title>Title 1</title>
<span>Date 1</span>
<div>Content 1</div>
<span>keyword</span>
<title>Title 2</title>
<span>Date 2</span>
<div>Content 2</div>
...
<span>keyword</span>
<title>Title N</title>
<span>Date N</span>
<div>Content N</div>
My goal is to get all the title
, date
, and contents inside div
for each section, but the sections themselves are not located in separated div
or other elements, just one after another until the N-th section.我的目标是获取每个部分的
div
所有title
、 date
和内容,但这些部分本身并不位于单独的div
或其他元素中,只是一个接一个,直到第 N 个部分。
While I can try to get all the title[1:N]
, date[1:N]
, and div[1:N]
as a list with len() = N
, doing so prevent debugging as if N
goes to 10,000 and len(title)==len(date)==len(div) -> False
, it will be hard to find where thing goes wrong (eg some titles are put in <strong>
instead of <title>
).虽然我可以尝试将所有的
title[1:N]
、 date[1:N]
和div[1:N]
作为len() = N
的列表,这样做会阻止调试,好像N
达到 10,000 和len(title)==len(date)==len(div) -> False
,很难找到哪里出错了(例如有些标题放在<strong>
而不是<title>
)。
One item I noticed is to the keyword located between each section.我注意到的一项是位于每个部分之间的关键字。 With the help of that keyword, is it possible to separate the entire
html
into N parts, and hopefully get item[i] = ["Title_i", "Date_i", "DIV_i"]
for each section through iteration?借助该关键字,是否可以将整个
html
分成 N 个部分,并希望通过迭代为每个部分获得item[i] = ["Title_i", "Date_i", "DIV_i"]
?
This way missing data will be represented as item[1]=["", Date_i, Div_i ]
and will be much easier to locate.这样丢失的数据将表示为
item[1]=["", Date_i, Div_i ]
并且更容易定位。
Carl, you may try to split html file content into concise parts by keywords. Carl,您可以尝试通过关键字将 html 文件内容拆分为简洁的部分。
Content
part... so you'd better split not with pure keyword values, nor with <span>keyword</span>
expression but with the most unique <span>keyword</span>\\s*<title>
and <span>keyword</span><strong>
expressions.Content
部分内重复...因此您最好不要使用纯关键字值或<span>keyword</span>
表达式进行拆分,而是使用最独特的<span>keyword</span>\\s*<title>
和<span>keyword</span><strong>
表达式。 Thus you split parts correctly with a big probability.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.