I have a string which looks like below
answer = """
models sold in last 4 weeks
+---------------+
| pcid |
+---------------+
| 22bv03 |
| 3eer3d |
| fes44h2j555j |
| 4mee33ikj5sq1 |
| 99dkk3bvr32a |
| cv44trmq011sa |
| lo33xc1a |
+---------------+***For more information, please visit the company page.
"""
Now I need to extract just the table
from the string such that the end result looks like
+---------------+
| pcid |
+---------------+
| 22bv03 |
| 3eer3d |
| fes44h2j555j |
| 4mee33ikj5sq1 |
| 99dkk3bvr32a |
| cv44trmq011sa |
| lo33xc1a |
+---------------+
Now I tried doing something like this
answer.split("***")[0].split("\n")[1]
But doing so, I only get the header against the expected table
.
How do I ensure that I can only extract table
from the string? Is there any regex
that can be applied here?
I might try:
answer = re.sub(r'^.*?(?=\+-)|\*\*\*.*$', '', answer, flags=re.DOTALL)
print(answer)
This prints:
+---------------+
| pcid |
+---------------+
| 22bv03 |
| 3eer3d |
| fes44h2j555j |
| 4mee33ikj5sq1 |
| 99dkk3bvr32a |
| cv44trmq011sa |
| lo33xc1a |
+---------------+
The regex uses an alternation, to handle trimming the answer string at both the beginning and the end. First:
^.*?(?=\+-)
removes all content from the start of the string up to, but not including, the start of the table ( +-
). The second part:
\*\*\*.*$
removes all content from the start of the footnote ( ***
) until the end of the string.
It looks as though you wanted to match from the first occurrence of a fixed delimiter to the last occurrence of the same delimiter.
In this case, you do not have to use a regex:
sep = '+---------------+'
start = answer.find(sep)
end = answer.rfind(sep)
print(answer[start:end+len(sep)])
See the Python demo yieling
+---------------+
| pcid |
+---------------+
| 22bv03 |
| 3eer3d |
| fes44h2j555j |
| 4mee33ikj5sq1 |
| 99dkk3bvr32a |
| cv44trmq011sa |
| lo33xc1a |
+---------------+
With regex, you may directly match from the first till last occurrence of the separator:
import re
answer = """
models sold in last 4 weeks
+---------------+
| pcid |
+---------------+
| 22bv03 |
| 3eer3d |
| fes44h2j555j |
| 4mee33ikj5sq1 |
| 99dkk3bvr32a |
| cv44trmq011sa |
| lo33xc1a |
+---------------+***For more information, please visit the company page.
"""
sep = '+---------------+'
m = re.search(r'(?sm)^{0}.*{0}'.format(re.escape(sep)), answer)
if m:
print(m.group())
Regex details
(?sm)
- dot now matches line breaks and ^
matches start of a line ^
- start of a line \+---------------\+
- a separator pattern .*
- any 0+ chars as many as possible \+---------------\+
- separator pattern I tried this as follows
Step 1: Identify the Index range by running below code
print(answer.index("ks"))
print(answer.index("***"))
You will find out index range of table ie [28:226]
and comment out this code once you found the range.
Step 2:
print(answer[28:226])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.