简体   繁体   English

将长字符串中的子字符串切成python中的列表

[英]Slice substrings from long string to a list in python

In python I have long string like (of which I removed all breaks) 在python中,我有一个很长的字符串,例如(我删除了所有的中断)

stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'

What I want to do is to search this string for all occurrences of "key:" , then extract the "values" following "key:" . 我想做的是在此字符串中搜索所有出现的"key:" ,然后在"key:"之后提取“ values "key:" One further complication for me is that I don't know how long these values belonging to key are (eg key:12/eas9 and key:43/e3 ). 对我来说,另一个麻烦是我不知道这些属于key的值有多长时间(例如key:12/eas9key:43/e3 )。 All I do know is that they do have to end with a digit whereas the rest of the string does not contain any digits. 我所知道的是,它们必须以数字结尾,而字符串的其余部分不包含任何数字。

This is why my idea was to slice from the indices of key plus the next say 10 characters (eg key:12/eas9g ) and then work backward until isdigit() is false. 这就是为什么我的想法是从key的索引加上下一个10个字符(例如key:12/eas9g )的索引中切出,然后向后工作直到isdigit()为false。

I tried to split my initial string (that did contain breaks): 我试图分割我的初始字符串(确实包含中断):

stringA_split = re.split("\n", stringA)

for linex in stringA_split:
  index_start = linex.rfind("key:")
  index_end = index_start + 8
  print(linex[index_start:index_end]
  #then work backward

However, inserting line breaks does not help in any way as they are meaningless from a pdf-to-txt conversion. 但是,插入换行符没有任何帮助,因为它们对于pdf到txt转换毫无意义。

How would I then solve this (eg as a start with getting all indices of '"key:"' and slice this to a list)? 然后,我将如何解决这个问题(例如,首先获取所有““ key:”“索引并将其切成列表)?

I'm not 100% sure I understand your definition of what defines a value, but I think this will get you what you described 我不确定100%知道您对定义值的定义的理解,但是我认为这将为您提供您所描述的内容

import re
stringA = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
for v in stringA.split('key:'):
    ma = re.match(r'(\d+\/.*\d+)', v)
    if ma:
        print ma.group(1)

This returns: 返回:

12/eas9
43/e3

You can apply just one RE that gets all the keys into an array of tuples: 您可以仅应用一个将所有键放入元组数组的RE:

import re
p=re.compile('key\:(\d+)\/([^\d]+\d)')
ret=p.findall(stringA)

After the execution, you have: 执行后,您将具有:

ret 
[('12', 'eas9'), ('43', 'e3')]
import re

>>> re.findall('key:(\d+[^\d]+[\d])', stringA)
['12/eas9', '43/e3']

\\d+ # One or more digits. \\d+ #一个或多个数字。

[^\\d]+ # Everything except a digit (equivalent to [\\D] ). [^\\d]+ #除数字外的所有内容(相当于[\\D] )。

[\\d] # The final digit [\\d] #最后一位

(\\d+[^\\d]+[\\d]) # The group of the expression above (\\d+[^\\d]+[\\d]) #上面表达式的组

'key:(\\d+[^\\d]+[\\d])' # 'key:' followed by the group expression 'key:(\\d+[^\\d]+[\\d])' #'key:'后跟组表达式

If you want key: in your result: 如果您想要key:在结果中:

>>> re.findall('(key:\d+[^\d]+[\d])', stringA)
['key:12/eas9', 'key:43/e3']

edit: a better answer was posted above. 编辑:更好的答案张贴在上面。 I misread the original question when proposing to reverse here, which really wasn't necessary. 当我建议在此处进行反向操作时,我误读了原始问题,这实际上不是必需的。 Good luck! 祝好运!

If you know that the format is always key:, what if you reversed the string and rex for :yek? 如果您知道格式始终是key :,如果将:yek的字符串和rex取反,该怎么办? You'd isolate all keys and then can reverse them back 您将隔离所有按键,然后可以将其反向

import re
# \w is alphanumeric, you may want to add some symbols
rex = re.compile("\w*:yek")

word = 'abcdefkey:12/eas9ghijklkey:43/e3mnop'
matches = re.findall(rex, word[::-1])
matches = [match[::-1] for match in matches]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM