简体   繁体   English

在python中使用尖号字符拆分字符串

[英]Split string with caret character in python

I have a huge text file, each line seems like this: 我有一个巨大的文本文件,每一行看起来像这样:

Some sort of general menu^a_sub_menu_title^^pagNumber 某种常规菜单^ a_sub_menu_title ^^ pagNumber

Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). 请注意,第一个“常规菜单”有空格,第二个部分(副标题)每个单词都用“ _”字符分隔,最后是一个数字(pag数字)。 I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python. 我想将每行分成3个(明显的)部分,因为我想在python中创建某种目录。

I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it. 我正在尝试使用re模块,但是由于插入符号在此类模块中具有很强的含义,所以我不知道该怎么做。

Could someone please help me???? 有人可以帮我吗?

>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']

If you only want three pieces you can accomplish this through a generator expression: 如果只需要三部分,则可以通过生成器表达式来完成:

line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']

What you need to do is to "escape" the special characters, like r'\\^' . 您需要做的是“转义”特殊字符,例如r'\\^' But better than regular expressions in this case would be: 但是在这种情况下,比正则表达式更好的是:

line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')

That gives you the components in a much more straightforward fashion. 这样可以使您的组件更加简单明了。

You could just say string.split("^") to divide the string into an array containing each segment. 您可以只说string.split("^")即可将字符串分成一个包含每个段的数组。 The only caveat is that it will divide consecutive caret characters into an empty string. 唯一的警告是,它将连续的插入号字符分成一个空字符串。 You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array. 您可以通过将连续的克拉分解成单个克拉或在结果数组中检测空字符串来防止这种情况。

For more information see http://docs.python.org/library/stdtypes.html 有关更多信息,请参见http://docs.python.org/library/stdtypes.html

Does that help? 有帮助吗?

It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. 也有可能您的文件使用的格式与csv模块兼容,因此您也可以进行调查,特别是如果该格式允许引用,因为这样line.split会中断。 If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best. 如果格式不使用引号,而只是定界符和文本,则line.split可能是最好的。

Also, for the re module, any special characters can be escaped with \\ , like r'\\^' . 同样,对于re模块,任何特殊字符都可以使用\\进行转义,例如r'\\^' I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." 我建议在跳到re之前使用以下方法:1)学习如何编写正则表达式,2)首先寻找问题的解决方案,而不是跳到正则表达式-«有些人在遇到问题时,以为“我知道,我将使用正则表达式。” Now they have two problems. 现在他们有两个问题。 » »

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM