简体   繁体   中英

How to extract certain substring from a multi line string in Python?

I have a string which looks like below

answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""

Now I need to extract just the table from the string such that the end result looks like

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

Now I tried doing something like this

answer.split("***")[0].split("\n")[1]

But doing so, I only get the header against the expected table .

How do I ensure that I can only extract table from the string? Is there any regex that can be applied here?

I might try:

answer = re.sub(r'^.*?(?=\+-)|\*\*\*.*$', '', answer, flags=re.DOTALL)
print(answer)

This prints:

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

The regex uses an alternation, to handle trimming the answer string at both the beginning and the end. First:

^.*?(?=\+-)

removes all content from the start of the string up to, but not including, the start of the table ( +- ). The second part:

\*\*\*.*$

removes all content from the start of the footnote ( *** ) until the end of the string.

It looks as though you wanted to match from the first occurrence of a fixed delimiter to the last occurrence of the same delimiter.

In this case, you do not have to use a regex:

sep = '+---------------+'
start = answer.find(sep)
end = answer.rfind(sep)
print(answer[start:end+len(sep)])

See the Python demo yieling

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

With regex, you may directly match from the first till last occurrence of the separator:

import re
answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""
sep = '+---------------+'
m = re.search(r'(?sm)^{0}.*{0}'.format(re.escape(sep)), answer)
if m:
    print(m.group())

See another regex demo

Regex details

  • (?sm) - dot now matches line breaks and ^ matches start of a line
  • ^ - start of a line
  • \+---------------\+ - a separator pattern
  • .* - any 0+ chars as many as possible
  • \+---------------\+ - separator pattern

I tried this as follows

Step 1: Identify the Index range by running below code

print(answer.index("ks")) 

print(answer.index("***"))

You will find out index range of table ie [28:226] and comment out this code once you found the range.

Step 2:

print(answer[28:226])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM