简体   繁体   English

如何使用正则表达式从文件中提取文本?

[英]How can I extract text from a file using regex?

Hi I'm looking for a way to extract a part of a text file with Python using a Regex: here is my code: 嗨,我正在寻找一种使用正则表达式使用Python提取文本文件的一部分的方法:这是我的代码:

    texfile=open("texte.txt", "r")
        for line in texfile:
            if re.match("^text(.*)", line):
               print line,

I'm seaching for the text following the word text until it reaches the end of paragraph or when it reach a white space,but my code return just the words which follow the word "text" on 1 line. 我一直在寻找单词text之后的文本,直到到达段落末尾或到达空白为止,但是我的代码只返回一行中紧随单词“ text”之后的单词。

For Example : 例如 :

bla bla hhhhhhhh text bla blajjjjjjjjjjjjjjjjjjjjj
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
ffff

Must return : 必须返回:

bla blajjjjjjjjjjjjjjjjjjjjj
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
ffff

thank you I tried all your codes but no one works as I want:I want now to render it simple,extract the stream which follow a certain "text" until it reach a blank line: 谢谢,我尝试了所有代码,但没有一个能如我所愿:我现在想使它变得简单,提取遵循某个“文本”的流,直到到达空白行:

          text
    sssssssssssssssss
     ssssssss
    kkkk
    lllmmm

    kkkk

   ;must return 
    sssssssssssssssss
    ssssssss
    kkkk
    lllmmm
    ;because of the blank line

If you want to detect a part of a file extending on several lines, and if the file isn't too much gigantic, it's not a particularly good method to limit the power of regexes by examining one line at a time. 如果要检测文件扩展成几行的一部分,并且文件不是太大,那么通过一次检查一行来限制正则表达式的功能不是一个特别好的方法。 When the file can be read and put entirely in RAM, it is better to analyze it with a regex exploring the text as a unique whole. 当可以读取文件并将其完全放入RAM时,最好使用正则表达式对文件进行分析,以将文本视为一个唯一的整体。

Note that '^' has significance "beginning of the string" if flag re.MULTILINE isn't used, and "beginning of a line" if this flag is used. 请注意,如果不使用标志re.MULTILINE ,则'^'含义为“字符串的开头” ,而如果使用此标志,则其re.MULTILINE“行的开头”

By the way, if you employ the match() method, you don't need to add "^" at the start of the RE pattern, since match() try to match from the very beginning of the string. 顺便说一句,如果使用match()方法,则不需要在RE模式的开头添加"^" ,因为match()尝试从字符串的最开始进行匹配。

So, here's a manner to analyze the whole text as you seem to want it (I use splitlines(True) to obtain a list of the lines in the string ss , this list simulates a file° : 因此,这是一种分析整个文本的方式(我使用splitlines(True)来获取字符串ss splitlines(True)的列表,该列表模拟了一个文件°:

import re

ss = """   first line
    bli bli hhhhhhhh TEXT bla blajjjjjjjjj
hhhhhhhh  VVVVV
ZZZZZZ
    tttt
bolo bolo TEXTrumunu and badad
yyyyyyyyyyyyyyyy
kkkkkkkkkkk
jjjjjjjjjjjjjjj
   nnnn    uytr
      poiurrr
ahahahah bobobo
  ppppp TEXT aaaabbbbb cccccg    
      kmsms
TEXT fedex redex bidex
pududadi
A

no-whitespace-before-that
   hhrhezipo"""

regx = re.compile('TEXT *(.+(?<! )(?<!\r)(?:\n[^ ]+(?<!\n))?)')

for fnd in regx.findall(ss):
    print '\n'.join(map(repr,fnd.splitlines(True)))
    print '---------------------------------'

result: 结果:

'bla blajjjjjjjjj\n'
'hhhhhhhh'
---------------------------------
'rumunu and badad\n'
'yyyyyyyyyyyyyyyy\n'
'kkkkkkkkkkk\n'
'jjjjjjjjjjjjjjj'
---------------------------------
'aaaabbbbb cccccg'
---------------------------------
'fedex redex bidex\n'
'pududadi\n'
'A\n'
'\n'
'no-whitespace-before-that'
---------------------------------

.

If the file is gigantic and can't be charged in only one chunk in RAM, you can do: 如果文件很大并且不能仅在RAM中的一个块中收费,则可以执行以下操作:

import re

ss = """   first line
    bli bli hhhhhhhh TEXT bla blajjjjjjjjj
hhhhhhhh  VVVVV
ZZZZZZ
    tttt
bolo bolo TEXTrumunu and badad
yyyyyyyyyyyyyyyy
kkkkkkkkkkk
jjjjjjjjjjjjjjj
   nnnn    uytr
      poiurrr
ahahahah bobobo
  ppppp TEXT aaaabbbbb cccccg    
      kmsms
TEXT fedex redex bidex
pududadi
A

no-whitespace-before-that
   hhrhezipo"""

rigx = re.compile('TEXT *(.+\n?)')
li = []
for line in ss.splitlines(True):
    mat = rigx.search(line)
    if 'TEXT' in line:
        li.append(mat.group(1))
    elif ' ' in line and li:
        if not line.startswith(' '):
            li.append(line.split(' ')[0])
        li[-1] = li[-1].rstrip(' \r\n')
        print '\n'.join(map(repr,li))
        print '====================='
        li = []
    elif li:
        li.append(line)

This code gives the same result as the former one. 此代码与前一个代码具有相同的结果。 You see it's less simple. 您会发现它不那么简单。 That's because big-big files are more problematic. 那是因为大文件更大的问题。

This worked for me in python3: 这在python3中对我有用:

for line in texfile:
    x = re.search("(.*?)(text)", line)
    try:
        print (x.group(1))
    except:
        print(line)

if you're not forced into using regexes you could use this... 如果您不被迫使用正则表达式,则可以使用此...

Load the file as a list: 将该文件作为列表加载:

with open("texte.txt", "r") as fileInput:
    listLines = fileInput.readlines()

Get the index of the line that contains your keyword, if it exists more than once you might not get the expected result, but it's an easy fix: 获取包含关键字的行的索引,如果该索引存在多次,则可能无法获得预期的结果,但这很容易解决:

listIndex = [i for i, item in enumerate(listLines) if "text" in item]

This are the lines that follow your keyword without blank lines, by slicing the list: 通过对列表进行切片,以下是关键字后面的行,没有空行:

lines = [line for line in listLines[listIndex[0]:] if line]

You might want to get any text following you keyword with: 您可能希望在关键字之后输入以下内容:

lineMatched = listLines[listIndex].split("text")[1].strip()

& print the result: 并打印结果:

print "\n".join([lineMatched] + lines if lineMatched else lines)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM