简体   繁体   中英

How to remove html tags from a free text column using python

I have a free text field column in python dataframe with html tags.

 ID Free text field
    1   <p><span style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family: 
        Arial; font-size: 10.5pt;">TExt1:</span></p><p><span style="background-color: rgb(255, 255, 
        255); color: rgb(37, 36, 35); font-family: Arial; font-size: 10.5pt;">Score: 5</span></p><p> 
        <span style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family: Arial; 
         font-size: 10.5pt;">B - </span><span style="background-color: rgb(255, 255, 255); color: 
         rgb(36, 36, 36); font-family: Arial; font-size: 10.5pt;">TExt2</span></p><p><span 
         style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family: Arial; 
         font-size: 10.5pt;">Text6</span></p><p><span style="background-color: rgb(255, 255, 255); 
         color: rgb(37, 36, 35); font-family: Arial; font-size: 10.5pt;">Text3</span></p><p><span 
         style="background-color: rgb(255, 255, 255); color: rgb(37, 36, 35); font-family: Arial; 
         font-size: 10.5pt;">Text4</span></p>
    2   <p>Text10</p>
    3   <p>Sky is blue</p>
    4   <p>Text3</p><p><br></p><p>Text19</p>
    5   <p> Complaint1</p><p><br></p><p>Text1</p><p>hospo 2</p><p>Tes45</p><p><br></p><p>test</p>
    6   <p>Test44</p>
    7   <p>Test54</p>

Is there anyway I could remove those html tags?

Any help would be appreciated.

Thanks

try using Beautiful Soup

from bs4 import BeautifulSoup

df['free text'].apply(
    lambda x: list(BeautifulSoup(x, "html.parser").stripped_strings)
)

0                                     [Text10]
1                                [Sky is blue]
2                              [Text3, Text19]
3    [Complaint1, Text1, hospo 2, Tes45, test]
4                                     [Test44]
5                                     [Test54]
Name: free text, dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM