简体   繁体   English

Python - 搜索字符串,复制到doc结束

[英]Python - search for string, copy until end of doc

I am using python to open EML files one at a time, process them then move them to another folder. 我使用python一次打开一个EML文件,处理它们然后将它们移动到另一个文件夹。 EML file contains an email message including the headers. EML文件包含包含标题的电子邮件。

The first 35-40 lines of the EML are header info, followed by the actual email message. EML的前35-40行是标题信息,后跟实际的电子邮件消息。 Since the amount of lines of the header changes, I cant just convert my EML file to a list and tell it: 由于标题的行数发生了变化,我无法将我的EML文件转换为列表并告诉它:

print emllist[37:]

However, the beginning of the last line of the headers is always the same and begins with X-OriginalArrivalTime. 但是,标题的最后一行的开头始终相同,并以X-OriginalArrivalTime开头。

My goal is to parse my EML file, search for the line number X-OriginalArrivalTime is on and then split the EML into 2 strings, one containing the headers info and one containing the message. 我的目标是解析我的EML文件,搜索行号X-OriginalArrivalTime,然后将EML分成2个字符串,一个包含标题信息,另一个包含消息。

I have been rereading the python re documentation, but I cant seem to come up with a good way to attack this. 我一直在重读python re文档,但我似乎无法想出一个很好的方法来攻击它。

Any help is greatly appreciated 任何帮助是极大的赞赏

thanks 谢谢

lou

You can probably avoid regex. 你可以避免使用正则表达式。 How about: 怎么样:

msg = data.split('X-OriginalArrivalTime', 1)[1].split('\n', 1)[1]

The re module is not very good at counting lines. re模块不是很擅长计算线。 What's more, you probably don't need it to check for the contents of the start of a line. 更重要的是,你可能不需要它来检查一行开头的内容。 The following function takes the filename of the EML file as input and returns a tuple containing two strings: the header, and the message. 以下函数将EML文件的文件名作为输入,并返回包含两个字符串的元组:标题和消息。

def process_eml(filename):
    with open(filename) as fp:
        lines = fp.readlines()

    for i, line in enumerate(lines):
        if line.startswith("X-OriginalArrivalTime"):
             break
    else:
        raise Exception("End of header not found")

    header = '\n'.join(lines[:i+1]) # Message starts at i + 1
    message = '\n'.join(lines[i+1:])

    return header, message

After

match = re.search(r'(.*^X-OriginalArrivalTime[^\n]*\n+)(.*)$',
                  open('foo.eml').read(),
                  re.DOTALL | re.MULTILINE)

match.groups(1) should contain the headers and match.groups(2) the email message's body. match.groups(1)应该包含标题和match.groups(2)电子邮件消息的正文。 The re.DOTALL flag causes . re.DOTALL标志导致. to match newlines. 匹配换行符。

I am not sure if it works with eml files, but python has a module to work with email files. 我不确定它是否适用于eml文件,但python 有一个模块可以处理电子邮件文件。

If that does not work, isn't it true that headers are split from message with an empty-line? 如果这不起作用,是否从带有空行的消息中拆分标题?

lines = fp.readlines()
header_end = lines.index('\n') # first empty line, I think it is the end of header.
headers = lines[:header_end]
message = lines[header_end:]

That's right that it would be interesting to avoid a regex, but presently, since you want to dispatch the header and the message into TWO different strings, I think that split() , that eliminates the sequence on which the split is made, and partition() , that returns a tuple of 3 elements, do not fit for the purpose , so a regex keeps interest: 这是正确的,避免正则表达式是有趣的,但目前,因为你想将标题和消息分配到两个不同的字符串,我认为split() ,消除了进行拆分的顺序,以及分区() ,返回3个元素的元组,不适合此目的,因此正则表达式保持兴趣:

import re

regx = re.compile('(.+?X-OriginalArrivalTime\.[^\n]*[\r\n]+)'
                  '(.+)\Z',
                  re.DOTALL)

ss = ('blahblah blah\r\n'
      'totoro tootrototo \r\n'
      'erteruuty\r\n'
      'X-OriginalArrivalTime. 12h58 Huntington Point\r\n'
      'body begins here\r\n'
      'sdkjhqsdlfkghqdlfghqdfg\r\n'
      '23135468796786876544\r\n'
      'ldkshfqskdjf end of file\r\n')


header,message = regx.match(ss).groups()

print 'header :\n',repr(header)
print
print 'message :\n',repr(message)

result 结果

header :
'blahblah blah\r\ntotoro tootrototo \r\nerteruuty\r\nX-OriginalArrivalTime. 12h58 Huntington Point\r\n'

message :
'body begins here\r\nsdkjhqsdlfkghqdlfghqdfg\r\n23135468796786876544\r\nldkshfqskdjf end of file\r\n'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 循环一个字符串搜索,直到找到一个字符串python - Loop a string search until a string is found python 搜索文件中的字符串,并复制后面的所有行,直到string2 - Search file for string and copy all lines following until string2 有没有办法搜索字符串并复制前面的文本直到它到达逗号? - Is there a way to search for a string and copy text in front until it reaches a comma? 从给定的已找到子字符串中删除字符串字符,直到Python结束 - Remove string characters from a given found substring until the end in Python 匹配字符串的一部分,直到它到达行的末尾(python正则表达式) - match part of a string until it reaches the end of the line (python regex) 在数据框条目中搜索字符串并复制python - Search for string in a dataframe entry and copy it python 将字符串从一个文件复制到另一个文件,直到在 Python 中找到一个字符 - Copy string from file to file until one character is found in Python 在文件中查找字符串并复制,直到特定字符出现在Python中 - Find string in file and copy until a specific character appears in Python 如何连续搜索和替换,直到在 python 中找不到字符串? - how to Continuous search and replace until the string not find in python? 搜索开始字符串和搜索结束字符串,然后在python中打印开始到结束行之间的所有行 - search begin string and search end string then print all lines between begin to end lines in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM