简体   繁体   中英

How to get the inner html of an element using scrapy

This is my HTML document

<div class='my-class'>
    <p>some text</p>
</div>

I want to get the inner HTML of div.my-class element, which is:

<p>some text</p>

The inner html is not always a <p> it could be some other element.

Here is what I have tried but not able to get the desired output:

res = response.css('div.my-class').get(); 

/* result */
<div class='my-class'>
 <p>some text</p>
</div>

//-------------------------------------------

res = response.css('div.my-class::text').get(); 

/* result */
some text

Here is a way to get the children of the element of class my-class:

html = "<div class='my-class'><p>some text</p></div>"
response = Selector(text=html, type="html")
print(response.xpath('//*[@class="my-class"]/*').get())

The following CSS selector gets the expected output (* matches all descendant elements):

res = response.css('div.my-class::text *').get(); 

/* result */
<p>some text</p>

Note that if you have multiple child elements , then you need to use getall() to get the entire inner html, for example, if you have the following input:

<div class='my-class'>
    <h1>heade</h1>
    <p>
        outter paragraph
        <p>
            inner paragraph
            <link>label</label>
        </p>
    </p>
    
</div>

Then you can get all the inner elements, and join them into a single string variable:

// get all immediate children and put them into an array
res_array = response.css('div.my-class::text > *').getall(); 

// join the array elements into res
res = " ".join(res_array); 

*Note: if you don't include > before , then it would recursively go through the inner elements, which means the inner elements appear more than one in the array

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM