Separating HTML file with keyword for scraping

Question

I am programming in Python with Scrapy and have a huge html file with a structure similar to the one demonstrated below:

<span>keyword</span>
<title>Title 1</title>
<span>Date 1</span>
<div>Content 1</div>

<span>keyword</span>
<title>Title 2</title>
<span>Date 2</span>
<div>Content 2</div>

...

<span>keyword</span>
<title>Title N</title>
<span>Date N</span>
<div>Content N</div>

My goal is to get all the title , date , and contents inside div for each section, but the sections themselves are not located in separated div or other elements, just one after another until the N-th section.

While I can try to get all the title[1:N] , date[1:N] , and div[1:N] as a list with len() = N , doing so prevent debugging as if N goes to 10,000 and len(title)==len(date)==len(div) -> False , it will be hard to find where thing goes wrong (eg some titles are put in  instead of <title> ).

One item I noticed is to the keyword located between each section. With the help of that keyword, is it possible to separate the entire html into N parts, and hopefully get item[i] = ["Title_i", "Date_i", "DIV_i"] for each section through iteration?

This way missing data will be represented as item[1]=["", Date_i, Div_i ] and will be much easier to locate.

Answer 1

Carl, you may try to split html file content into concise parts by keywords.

You should be able to know a full set/dictionary of all possible keywords.
Some keywords might be duplicated inside of any Content part... so you'd better split not with pure keyword values, nor with keyword expression but with the most unique keyword\\s*<title> and keyword expressions. Thus you split parts correctly with a big probability.

Separating HTML file with keyword for scraping

Question

1 answers

solution1
0 ACCPTED 2016-10-06 06:27:36

Separating HTML file with keyword for scraping

Question

1 answers

solution1 0 ACCPTED 2016-10-06 06:27:36

solution1
0 ACCPTED 2016-10-06 06:27:36