[英]Extract text from a specific pattern in text file using python
I have a text file from which I am trying to extract the titles to excel column.我有一个文本文件,我试图从中提取标题到 excel 列。 However, the required titles are within specific pattern:
但是,所需的标题在特定模式内:
COM *******************
COM * Title 1*
COM *******************
COM ***************************
COM * Sub 1 *
COM ***************************
{
...TEXT DETAILS...
}
COM ***************************
COM * Sub 2 *
COM ***************************
{
...TEXT DETAILS...
}
COM *******************
COM * Title 2*
COM *******************
COM ***************************
COM * T2 Sub 1 *
COM ***************************
{
...TEXT DETAILS...
}
COM ***************************
COM * T2 Sub 2 *
COM ***************************
{
...TEXT DETAILS...
}
The required output of string extraction (title) format is:字符串提取(标题)格式所需的output为:
['Title 1', 'Sub 1',..,'T2 Sub 2']
or excel column as或 excel 列为
CATEGORY
Title 1
Sub 1
Sub 2
Title 2
T2 Sub 1
T2 Sub 2
It is actually the 'COM *****' pattern and the middle line consisting of the title that I am unable to implement.它实际上是我无法实现的'COM *****'模式和由标题组成的中间线。 I recently extracted required string based on string pattern which I think was similar to my current problem.
我最近根据字符串模式提取了所需的字符串,我认为这与我当前的问题类似。
For that case i/p text file was in this format:对于那种情况,i/p 文本文件采用这种格式:
CTG 'GEN:LT'
{
TEXT DETAILS....
}
CTG 'GEN:FR'
{
TEXT DETAILS....
}
CTG 'GEN:G_L02'
{
TEXT DETAILS....
}
CTG 'GEN:ER'
{
TEXT DETAILS....
}
CTG 'GEN:C1'
{
TEXT DETAILS....
}
My goal was to extract the string after CTG which is in ' ' My idea here was to detect the CTG string and print the string next to it.我的目标是提取位于' '中的CTG之后的字符串 我的想法是检测CTG字符串并打印其旁边的字符串。 And here is how I implemented the same:
这是我如何实现相同的:
import re
def getCtgName(text):
matches = re.findall(r"'(.+?)'",text)
return matches
mylines = [] # Declare an empty list.
with open ('filepath.txt', 'rt') as myfile: # Open .txt for reading text.
for myline in myfile: # For each line in the file,
mylines.append(myline.rstrip('\n')) # strip newline and add to list.
columns = []
substr = "CTG" # substring to search for.
for line in mylines: # string to be searched
if substr in line:
columns.append(getCtgName(line)[0])
print(columns)
And got the output as:并得到 output 作为:
['GEN:LT', 'GEN:FR',..., 'GEN:C1']
I believe similar logic can be implemented for the Title extraction between those comment (COM****) lines, any help with the code or logic or resources will be appreciated.我相信类似的逻辑可以在这些注释 (COM****) 行之间的标题提取中实现,任何对代码或逻辑或资源的帮助将不胜感激。 Thank you!
谢谢!
I think you can simplify this code into one regex pattern, using lookbehind and lookahead.我认为您可以使用 lookbehind 和 lookahead 将此代码简化为一个正则表达式模式。 These two techniques allow you to specify a certain part that has to come before or after the match, but which aren't included in the match itself.
这两种技术允许您指定必须在匹配之前或之后出现的特定部分,但不包含在匹配本身中。 The syntax is
(?<=text)
for lookbehind and (?=text)
for lookahead.后视的语法是
(?<=text)
,前视的语法是 (? (?=text)
。
So, the part that comes before a title is COM ***************************\nCOM *
and the part that comes behind is *\nCOM ***************************
.所以,标题前面的部分是
COM ***************************\nCOM *
而后面的部分是*\nCOM ***************************
。 When we put this in the regex syntax, the pattern is:当我们把它放在正则表达式语法中时,模式是:
(?<=COM \*{27}\nCOM \*)[^\n]+(?=\*\nCOM \*{27})
In python code, that becomes:在 python 代码中,变为:
import re
with open ('filepath.txt', 'rt') as myfile:
txt=myfile.read()
pattern=r"(?<=COM \*{27}\nCOM \*)[^\n]+(?=\*\nCOM \*{27})"
titles=re.findall(pattern,txt)
Another way of doing this would be using your code first and then delete all occurences of "***************************" in the result.另一种方法是先使用您的代码,然后删除结果中出现的所有“***************************”。
An implementation:一个实现:
import re
def getCtgName(text):
matches = re.findall(r"'(.+?)'",text)
return matches
mylines = [] # Declare an empty list.
with open ('filepath.txt', 'rt') as myfile: # Open .txt for reading text.
for myline in myfile: # For each line in the file,
mylines.append(myline.rstrip('\n')) # strip newline and add to list.
titles = []
substr = "CTG" # substring to search for.
for line in mylines: # string to be searched
if substr in line:
titles.append(getCtgName(line)[0])
while "*"*27 in titles:
titles.remove("*"*27)
print(titles)
simply use the following regex instead of your regex in the function getCtgName assuming that the titles and subjects will not have * as a value:假设标题和主题没有 * 作为值,只需在 function getCtgName 中使用以下正则表达式而不是您的正则表达式:
matches = re.findall(r"COM\s*\*([^*]+)", text)
I am assuming that titles won't contain *
characters.我假设标题不包含
*
字符。
import re
headings = []
# Assuming that each line from the text file is already read and stored in a list named 'strings'
for string in strings:
sub_string = re.search('COM \*([^*]+)\*', string)
if sub_string:
headings.append(sub_string.group(1).strip())
Input:输入:
strings = [
'COM *******************',
'COM * Title 1*',
'COM *******************',
'COM ***************************',
'COM * Sub 1 *',
'COM ***************************',
'{',
'...TEXT DETAILS...',
'}',
'COM ***************************',
'COM * Sub 2 *',
'COM ***************************',
'{',
'...TEXT DETAILS...',
'}',
'COM *******************',
'COM * Title 2*',
'COM *******************',
'COM ***************************',
'COM * T2 Sub 1 *',
'COM ***************************',
'{',
'...TEXT DETAILS...',
'}',
'COM ***************************',
'COM * T2 Sub 2 *',
'COM ***************************',
'{',
'...TEXT DETAILS...',
'}',
]
Output: Output:
['Title 1', 'Sub 1', 'Sub 2', 'Title 2', 'T2 Sub 1', 'T2 Sub 2']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.