简体   繁体   English

提取 .txt 文件中两个关键字之间的所有单词

[英]Extract all words between two keywords in .txt file

I would like to extract all words within specific keywords in a .txt file.我想提取 .txt 文件中特定关键字内的所有单词。 For the keywords, there is a starting keyword of PROC SQL;对于关键字,有一个PROC SQL;的起始关键字PROC SQL; (I need this to be case insensitive) and the ending keyword could be either RUN; (我需要这个不区分大小写)并且结束关键字可以是RUN; , quit; quit; or QUIT;QUIT; . . This is my sample .txt file .这是我的示例.txt文件

Thus far, this is my code:到目前为止,这是我的代码:

with open('lan sample text file1.txt') as file:
    text = file.read()
    regex = re.compile(r'(PROC SQL;|proc sql;(.*?)RUN;|quit;|QUIT;)')
    k = regex.findall(text)
    print(k)

Output:输出:

[('quit;', ''), ('quit;', ''), ('PROC SQL;', '')]

However, my intended output is to get the words in between and inclusive of the keywords:但是,我的预期输出是获取介于关键字之间并包含关键字的单词:

proc sql; ("TRUuuuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(
SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
    FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and 
      (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))
 );

1)

jjjjjj;

  select xx("xE'", PUT(xx.xxxx.),"'") jdfjhf:jhfjj from xxxx_x_xx_L ;
quit; 

PROC SQL; ("CUuuiiiiuth");
hhhjhfjs as fdsjfsj:
select * from djfkjd to jfkjs
(SELECT abc AS abc1, abc_2_ AS efg, abc_fg, fkdkfj_vv, jjsflkl_ff, fjkdsf_jfkj
    FROM &xxx..xxx_xxx_xxE
where ((xxx(xx_ix as format 'xxxx-xx') gff &jfjfsj_jfjfj.) and 
      (xxx(xx_ix as format 'xxxx-xx') lec &jgjsd_vnv.))(( ))
 );

2)(

RUN;

Any advice or different ways to go about this would be greatly appreciated!任何建议或不同的方法来解决这个问题将不胜感激!

Output after implementing user @Finefoot's code:实现用户@Finefoot 的代码后的输出: 在此处输入图片说明

However, is there a way to separate the lines to look something like this instead?:但是,有没有办法将线条分开,看起来像这样?:

在此处输入图片说明

This works for me:这对我有用:

import re

with open('lan sample text file1.txt') as file:
    condition = False
    text_to_return = ""
    for line in file:
        if condition == True:
            if line[0:5].lower() == "quit;" or line[0:4].upper() == "RUN;":
                condition = False    
            text_to_return += line
        if line[0:9].upper() == "PROC SQL;":
            condition = True
            text_to_return += line

    output_file = open("output.txt", "w")
    output_file.write(text_to_return)
    output_file.close()

Is this an acceptable solution to you?这是您可以接受的解决方案吗?

A solution if you don't want to use Regular Expressions:如果您不想使用正则表达式的解决方案:

starts=["PROC SQL;"]
ends = ["RUN;", "RUN;", "QUIT;"]

with open('/tmp/some_file.txt') as f:
    content = f.read() 

    for s, e in zip(starts, ends):
        if s.lower() in content.lower() and e.lower() in content.lower():
            start = content.lower().find(s.lower())
            end = content.lower().find(e.lower()) + len(e)

            print(content[start:end])

Does it help you?它对你有帮助吗?

In your pattern (PROC SQL;|proc sql;(.*?)RUN;|quit;|QUIT;) is a typo, I think, as you're missing a closing parenthesis ) after proc sql;在你的模式(PROC SQL;|proc sql;(.*?)RUN;|quit;|QUIT;)中,我认为是一个错字,因为你在proc sql;之后缺少一个右括号) proc sql; and before (.*?) as well as opening parenthesis ( afterwards. However, that's not all, you still won't get your desired result with the typo fixed.和之前(.*?)以及左括号(之后。然而,这还不是全部,你仍然不会得到你想要的结果并修复了错字。

Have a look at the Python docs for re :查看 Python 文档以了解re

. (Dot.) In the default mode, this matches any character except a newline. (点。)在默认模式下,这匹配除换行符之外的任何字符。 If the DOTALL flag has been specified, this matches any character including a newline.如果指定了DOTALL标志,则匹配任何字符,包括换行符。

As your input does contain newlines which you'd like .由于您的输入确实包含您想要的换行符. to match, you need to use the re.DOTALL flag.要匹配,您需要使用re.DOTALL标志。 While we're on the topic of flags: you might also want to use the re.IGNORECASE flag if you really don't care about case sensitivity of your keywords.虽然我们是关于标志的主题:如果您真的不关心关键字的大小写敏感度,您可能还想使用re.IGNORECASE标志。

Also, I guess you don't want your keywords like PROC SQL;另外,我猜你不想要像PROC SQL;这样的关键字PROC SQL; in your result, so you can use (?:...) which is the non-capturing version of regular parentheses.在您的结果中,您可以使用(?:...)这是常规括号的非捕获版本。

The final regex pattern:最终的正则表达式模式:

re.findall(r"(?:PROC SQL;)(.*?)(?:RUN;|QUIT;)", text, flags=re.IGNORECASE|re.DOTALL)

Update:更新:

In your update code in the Jupyter cell above, the results of re.findall are saved as variable regex .在上述Jupyter细胞的更新代码,结果re.findall保存为变量regex It's a list of strings, that match the pattern.这是一个与模式匹配的字符串列表。 If you call print(regex) you will print the list (which will show its elements, the strings, with \\n ).如果您调用print(regex)您将打印列表(它将显示其元素,字符串,带有\\n )。 If you don't want \\n , you could print the elements (the strings themselves) instead: print(*regex) The default separator between two elements will be a simple space character, though, so you might want to set sep to something else like multiple newlines print(*regex, sep="\\n"*5) or a separating line of ----- like print(*regex, sep="\\n"+"-"*44+"\\n") .如果您不想要\\n ,您可以打印元素(字符串本身): print(*regex)但是,两个元素之间的默认分隔符将是一个简单的空格字符,因此您可能希望将sep设置为某些内容否则像多个换行符print(*regex, sep="\\n"*5)-----的分隔线像print(*regex, sep="\\n"+"-"*44+"\\n") But that's something you'll have to decide which way will suit you best to present your results.但是,您必须决定哪种方式最适合您来展示您的结果。

Also, if the pattern doesn't seem too confusing already, you might want to use "inline modifiers" instead of the flags argument.此外,如果模式看起来还不太混乱,您可能需要使用“内联修饰符”而不是flags参数。 It's (?i:...) for case-insensitive matching and (?s:...) instead of the DOTALL flag:它是(?i:...)用于不区分大小写的匹配和(?s:...)而不是DOTALL标志:

re.findall(r"(?i:PROC SQL;)((?s:.*?))(?i:RUN;|QUIT;)", text)

You can get a more efficient match by matching the keywords and match all the lines that do not start with quit or RUN to prevent unnecessary backtracking caused by .*?可以通过匹配关键字获得更高效的匹配,匹配所有不以quitRUN开头的行,防止.*?

If you want the keywords included in the match, you can omit the capturing groups.如果您希望匹配中包含关键字,则可以省略捕获组。

You could use the re.IGNORECASE flag to get a case insensitive match and use re.MULTILINE as the pattern contains an anchor asserting the start of the string.您可以使用re.IGNORECASE标志来获取不区分大小写的匹配并使用re.MULTILINE作为模式包含断言字符串开头的锚点。

^PROC SQL;.*\n(?:(?!RUN;|QUIT;).*\n)*(?:RUN|QUIT);
  • ^ Start of line ^行首
  • PROC SQL; Match literally字面匹配
  • .*\\n Match 0+ times any char except a newline, then match a newline (or use \\r?\\n .*\\n匹配 0+ 次除换行符以外的任何字符,然后匹配换行符(或使用\\r?\\n
  • (?: Non capturing group (?:非捕获组
    • (?!RUN;|QUIT;) Assert what is directly to the right is not RUN; (?!RUN;|QUIT;)直接断言右边的不是RUN; or QUIT;QUIT;
    • .*\\n Match 0+ times any char except a newline, then match a newline .*\\n匹配 0+ 次除换行符以外的任何字符,然后匹配换行符
  • )* Close group and repeat 0+ times )*关闭组并重复 0+ 次
  • (?:RUN|QUIT); Match either RUN;匹配任一RUN; or QUIT;QUIT;

Regex demo |正则表达式演示| Python demo Python 演示

For example例如

with open('lan sample text file1.txt') as file:
    text = file.read()
    regex = re.compile(r'^PROC SQL;.*\n(?:(?!RUN;|QUIT;).*\n)*(?:RUN|QUIT);', re.MULTILINE | re.IGNORECASE)
    k = regex.findall(text)
    print(k)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM