简体   繁体   中英

How to extract column data from a text in python (regex)

Let's say we have text within which column header are stored in the form:

{|
|+ The table's caption
! scope="col" width="20"style="background-color:#cfcfcf;"align="center" | Column header 1
! scope="col" width="20"style="background-color:#ff55ff;"align="center" | Column header 2
! scope="col" | Column header 3
|-
! scope="row" | Row header 1
| Cell 2 || Cell 3
|-
! scope="row" | Row header A
| Cell B
| Cell C
|}

How can I extract all the columns ([ Column header 1 , Column header 2 , Column header 3 ]) from the text in python?

re.findall('*! scope="col" |', text, re.IGNORECASE)

But it's not doing the job.

https://regex101.com/r/PLKREz/6

How can I do it in Python?

You can find all the substrings after the last | in a line with scope="col" :

import re

data = """
{|
|+ The table's caption
! scope="col" width="20"style="background-color:#cfcfcf;"align="center" | Column header 1
! scope="col" width="20"style="background-color:#ff55ff;"align="center" | Column header 2
! scope="col" | Column header 3
|-
! scope="row" | Row header 1
| Cell 2 || Cell 3
|-
! scope="row" | Row header A
| Cell B
| Cell C
|}"""

print(re.findall(r'scope="col".*?\| ([^|]+)$', data, re.MULTILINE))

Prints:

['Column header 1', 'Column header 2', 'Column header 3']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM