I am programming in Python with Scrapy and have a huge html
file with a structure similar to the one demonstrated below:
<span>keyword</span>
<title>Title 1</title>
<span>Date 1</span>
<div>Content 1</div>
<span>keyword</span>
<title>Title 2</title>
<span>Date 2</span>
<div>Content 2</div>
...
<span>keyword</span>
<title>Title N</title>
<span>Date N</span>
<div>Content N</div>
My goal is to get all the title
, date
, and contents inside div
for each section, but the sections themselves are not located in separated div
or other elements, just one after another until the N-th section.
While I can try to get all the title[1:N]
, date[1:N]
, and div[1:N]
as a list with len() = N
, doing so prevent debugging as if N
goes to 10,000 and len(title)==len(date)==len(div) -> False
, it will be hard to find where thing goes wrong (eg some titles are put in <strong>
instead of <title>
).
One item I noticed is to the keyword located between each section. With the help of that keyword, is it possible to separate the entire html
into N parts, and hopefully get item[i] = ["Title_i", "Date_i", "DIV_i"]
for each section through iteration?
This way missing data will be represented as item[1]=["", Date_i, Div_i ]
and will be much easier to locate.
Carl, you may try to split html file content into concise parts by keywords.
Content
part... so you'd better split not with pure keyword values, nor with <span>keyword</span>
expression but with the most unique <span>keyword</span>\\s*<title>
and <span>keyword</span><strong>
expressions. Thus you split parts correctly with a big probability.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.