How to extract certain substring from a multi line string in Python?

Question

I have a string which looks like below

answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""

Now I need to extract just the table from the string such that the end result looks like

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

Now I tried doing something like this

answer.split("***")[0].split("\n")[1]

But doing so, I only get the header against the expected table .

How do I ensure that I can only extract table from the string? Is there any regex that can be applied here?

Answer 1

I might try:

answer = re.sub(r'^.*?(?=\+-)|\*\*\*.*$', '', answer, flags=re.DOTALL)
print(answer)

This prints:

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

The regex uses an alternation, to handle trimming the answer string at both the beginning and the end. First:

^.*?(?=\+-)

removes all content from the start of the string up to, but not including, the start of the table ( +- ). The second part:

\*\*\*.*$

removes all content from the start of the footnote ( *** ) until the end of the string.

Answer 2

It looks as though you wanted to match from the first occurrence of a fixed delimiter to the last occurrence of the same delimiter.

In this case, you do not have to use a regex:

sep = '+---------------+'
start = answer.find(sep)
end = answer.rfind(sep)
print(answer[start:end+len(sep)])

See the Python demo yieling

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

With regex, you may directly match from the first till last occurrence of the separator:

import re
answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""
sep = '+---------------+'
m = re.search(r'(?sm)^{0}.*{0}'.format(re.escape(sep)), answer)
if m:
    print(m.group())

See another regex demo

Regex details

(?sm) - dot now matches line breaks and ^ matches start of a line
^ - start of a line
\+---------------\+ - a separator pattern
.* - any 0+ chars as many as possible
\+---------------\+ - separator pattern

Answer 3

I tried this as follows

Step 1: Identify the Index range by running below code

print(answer.index("ks")) 

print(answer.index("***"))

You will find out index range of table ie [28:226] and comment out this code once you found the range.

Step 2:

print(answer[28:226])

How to extract certain substring from a multi line string in Python?

Question

3 answers

solution1
1 ACCPTED 2019-10-17 07:07:17

solution2
1 2019-10-17 07:20:42

solution3
0 2019-10-17 16:34:08

How to extract certain substring from a multi line string in Python?

Question

3 answers

solution1 1 ACCPTED 2019-10-17 07:07:17

solution2 1 2019-10-17 07:20:42

solution3 0 2019-10-17 16:34:08

solution1
1 ACCPTED 2019-10-17 07:07:17

solution2
1 2019-10-17 07:20:42

solution3
0 2019-10-17 16:34:08