简体   繁体   中英

Separating HTML file with keyword for scraping

I am programming in Python with Scrapy and have a huge html file with a structure similar to the one demonstrated below:

<span>keyword</span>
<title>Title 1</title>
<span>Date 1</span>
<div>Content 1</div>

<span>keyword</span>
<title>Title 2</title>
<span>Date 2</span>
<div>Content 2</div>

...

<span>keyword</span>
<title>Title N</title>
<span>Date N</span>
<div>Content N</div>

My goal is to get all the title , date , and contents inside div for each section, but the sections themselves are not located in separated div or other elements, just one after another until the N-th section.

While I can try to get all the title[1:N] , date[1:N] , and div[1:N] as a list with len() = N , doing so prevent debugging as if N goes to 10,000 and len(title)==len(date)==len(div) -> False , it will be hard to find where thing goes wrong (eg some titles are put in <strong> instead of <title> ).

One item I noticed is to the keyword located between each section. With the help of that keyword, is it possible to separate the entire html into N parts, and hopefully get item[i] = ["Title_i", "Date_i", "DIV_i"] for each section through iteration?

This way missing data will be represented as item[1]=["", Date_i, Div_i ] and will be much easier to locate.

Carl, you may try to split html file content into concise parts by keywords.

  1. You should be able to know a full set/dictionary of all possible keywords.
  2. Some keywords might be duplicated inside of any Content part... so you'd better split not with pure keyword values, nor with <span>keyword</span> expression but with the most unique <span>keyword</span>\\s*<title> and <span>keyword</span><strong> expressions. Thus you split parts correctly with a big probability.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM