[英]Python multiline regex + multi entries reading a file in one go
//Last modified: Sat, Apr 16, 2011 09:55:04 AM
//Codeset: ISO-8859-1
fileInfo "version" "20x64";
createNode newnode -n "a_SET";
addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
setAttr -l on -k off ".tx";
setAttr -l on -k off ".ty";
setAttr -l on -k off ".sz";
setAttr -l on -k on ".test1" -type "string" "blabla";
setAttr -l on -k on ".test2" -type "string" "blablabla";
createNode newnode -n "b_SET";
addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
setAttr -l on -k off ".tx";
setAttr -l on -k off ".ty";
setAttr -l on -k off ".sz";
setAttr -l on -k on ".test1" -type "string" "hmm";
setAttr -l on -k on ".test2" -type "string" "ehmehm";
in Python: 在Python中:
I need to read the newnode names for instance "a_SET" and "b_SET" and their corresponding attribute values so {"a_SET": {"test1":"blabla", "test2":"blablabla"} and the same for the b_SET - there could be unknown amount of sets - like c_SET d_SET etc. 我需要读取实例“ a_SET”和“ b_SET”的新节点名称及其对应的属性值,因此{“ a_SET”:{“ test1”:“ blabla”,“ test2”:“ blablabla”}}和b_SET相同-可能有未知数量的集合-如c_SET d_SET等。
I've tried looping through lines and matching it there: 我尝试遍历行并在其中进行匹配:
for line in fileopened:
setmatch = re.match( r'^(createNode set -n ")(.*)(_SET)(.*)' , line)
if setmatch:
sets.append(setmatch.group(2))
and as soon as I find a match here I would loop through next lines to get the attributes (test1, test2) for that set until I find a new set - for instance c_SET or an EOF. 在这里找到匹配项后,我将遍历下几行以获取该集合的属性(test1,test2),直到找到新集合(例如c_SET或EOF)为止。
What would be the best way to grab all that info in one go with the re.MULTILINE? 用re.MULTILINE一次性获取所有信息的最佳方法是什么?
You can use regexp positive lookahead to split the groups: 您可以使用regexp正向前瞻来拆分组:
(yourGroupSeparator)(.*?)(?=yourGroupSeparator|\Z)
In your example: 在您的示例中:
import re
lines = open("e:/temp/test.txt").read()
matches = re.findall(r'createNode newnode \-n (\"._SET\");(.*?)(?=createNode|\Z)', lines, re.MULTILINE + re.DOTALL);
for m in matches:
print "%s:" % m[0], m[1]
"""
Result:
>>>
"a_SET":
addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
setAttr -l on -k off ".tx";
setAttr -l on -k off ".ty";
setAttr -l on -k off ".sz";
setAttr -l on -k on ".test1" -type "string" "blabla";
setAttr -l on -k on ".test2" -type "string" "blablabla";
"b_SET":
addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
setAttr -l on -k off ".tx";
setAttr -l on -k off ".ty";
setAttr -l on -k off ".sz";
setAttr -l on -k on ".test1" -type "string" "hmm";
setAttr -l on -k on ".test2" -type "string" "ehmehm";
"""
If you want the results on a dict, you can use: 如果您希望将结果作为字典,则可以使用:
result = {}
for k, v in matches:
result[k] = v # or maybe v.split() or v.split(";")
after findall 在findall之后
I got this: 我懂了:
import re
filename = 'tr.txt'
with open(filename,'r') as f:
ch = f.read()
pat = re.compile('createNode newnode -n ("\w+?_SET");(.*?)(?=createNode|\Z)',re.DOTALL)
pit = re.compile('^ *setAttr.+?("[^"\n]+").+("[^"\n]+");(?:\n|\Z)',re.MULTILINE)
dic = dict( (mat.group(1),dict(pit.findall(mat.group(2)))) for mat in pat.finditer(ch))
print dic
result 结果
{'"b_SET"': {'".test2"': '"ehmehm"', '".test1"': '"hmm"'}, '"a_SET"': {'".test2"': '"blablabla"', '".test1"': '"blabla"'}}
. 。
Question: 题:
what if there must be character '"'
in the strings ? How is it represented ? 如果字符串中必须有字符'"'
,该怎么表示?
. 。
I had some difficulty to find the solution because I didn't choose the facility. 我没有选择该设施,因此很难找到解决方案。
Here's a new pattern that catches the FIRST string "..."
and the LAST string "..."
present after a string " setAttr"
and before the next " setAttr"
. 这是一个新模式,它捕获在字符串" setAttr"
和下一个" setAttr"
之前的FIRST字符串"..."
和LAST字符串"..."
" setAttr"
。 So several "..."
can be present , not only 3. You didn't asked this condition, but I thought it may happen to be needed. 因此可以出现多个"..."
,不仅是3。您没有询问这种情况,但我认为可能是有必要的。
I also managed to make possible the presence of newlines in the strings to catch "....\\n......"
, not only around them. 我还设法使字符串中存在换行符以捕获"....\\n......"
,而不仅仅是在它们周围。 For that , I was obliged to invent something new for me: (?:\\n(?! *setAttr)|[^"\\n])
that means : all characters, except '"'
and common newlines \\n
, are accepted and also only the newlines that are not followed by a line beginning with ' *setAttr'
为此,我不得不为我发明一些新东西: (?:\\n(?! *setAttr)|[^"\\n])
意味着:除'"'
和普通newlines \\n
之外'"'
所有字符均被接受以及仅以' *setAttr'
开头的行之后没有的换行符
For (?:\\n(?! *setAttr)|.)
it means : newlines not followed by a line beginning with ' *setAttr'
and all the other non-newline characters. 对于(?:\\n(?! *setAttr)|.)
它的意思是:换行符后没有以' *setAttr'
开头的行以及所有其他非换行符。
Hence, any other special sequence as tab or whatever else are automatically accpted in the matchings. 因此,匹配中会自动附加任何其他作为制表符的特殊序列或制表符。
ch = '''//Last modified: Sat, Apr 16, 2011 09:55:04 AM
//Codeset: ISO-8859-1
fileInfo "version" "20x64";
createNode newnode -n "a_SET";
addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
setAttr -l on -k off ".tx";
setAttr -l on -k off ".ty";
setAttr -l on -k off ".sz";
setAttr -l on -k on ".test1" -type "string" "blabla";
setAttr -l on -k on ".test2" -type "string" "blablabla";
createNode newnode -n "b_SET";
addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
setAttr -l on -k off ".tx";
setAttr -l on -k off ".ty";
setAttr -l on -k off ".sz";
setAttr -l on -k on ".test1" -type "string" (
"hmm bl
abla\tbla" );
setAttr -l on -k on ".tes\nt\t2" -type "string" "ehm\tehm";
setAttr -l on -k on ".test3" -type "string" "too
much" "pff" """ "feretini" "gol\nolo";
'''
import re
pat = re.compile('createNode newnode -n ("\w+?_SET");(.*?)(?=createNode|\Z)',re.DOTALL)
pot = re.compile('^ *setAttr.+?'
'"((?:\n(?! *setAttr)|[^"\n])+)"'
'(?:\n(?! *setAttr)|.)+'
'"((?:\n(?! *setAttr)|[^"\n])+)"'
'.*;(?:\n|\Z)',re.MULTILINE)
dic = dict( (mat.group(1),dict(pot.findall(mat.group(2)))) for mat in pat.finditer(ch))
for x in dic:
print x,'\n',dic[x],'\n'
result 结果
"b_SET"
{'.test3': 'gol\nolo', '.test1': 'hmm bl\n abla\tbla', '.tes\nt\t2': 'ehm\tehm'}
"a_SET"
{'.test1': 'blabla', '.test2': 'blablabla'}
Another possible option: 另一个可能的选择:
createNode newnode -n "b_SET";
addAttr -ci true -k true -sn "connections" -ln "connections" -dt "string";
setAttr -l on -k off ".tx";
setAttr -l on -k off ".ty";
setAttr -l on -k off ".sz";
setAttr -l on -k on ".test1" -type "string" (
"hmm blablabla" );
setAttr -l on -k on ".test2" -type "string" "ehmehm";
So as you can see ".test1" value is now split with a /n line separator. 如您所见, “。test1”值现在使用/ n行分隔符进行拆分。 How would you go around that using eyquem's approach? 您如何使用eyquem的方法解决呢?
pit = re.compile('^ *setAttr.+?("[^"\n]+").+("[^"\n]+");(?:\n|\Z)',re.MULTILINE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.