简体   繁体   中英

How to extract the common words before particular symbol and find particular word

IF I have a dictionary:

mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
          "g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
          "g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
          "g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
          "g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
          "g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
          "h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 6,
          "g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 7,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 8,
          "h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 9,
          "p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 10}
  1. I want to extract the common part g18_84pp_2A_MVP_GoodiesT0 before the first - .

  2. also I want add a _MIX to follow g18_84pp_2A_MVP_GoodiesT0 when finding the particular word MIX in first group . Assume that I am able to classify two groups depending on whether is MIX or FIX in myDict, then the final Output dictionary:

OutputNameDict= {"g18_84pp_2A_MVP_GoodiesT0_MIX" : 0,
                  "h18_84pp_3A_MVP_GoodiesT1_FIX" : 1,
                  "p18_84pp_2B_MVP_FIX": 2}

Is there any function I could use to find common part? How pick up the word before or after particular symbol like - and find particular words like MIX or FIX ?

You can use split to get the common part:

s = "g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt"
n = s.split('-')[0]

In fact, split will give you a list of each token delimited by '-' , so s.split('-') yields:

['g18_84pp_2A_MVP1_GoodiesT0', 'HKJ', 'DFG_MIX', 'CMVP1_Y1000', 'MIX.txt']

To see if MIX or FIX is in a string, you can use in :

if 'MIX' in s:
    print "then MIX is in the string s"

If you want to get rid if the numbers after 'MVP' , you can use re module:

import re
s = 'g18_84pp_2A_MVP1_GoodiesT0'
s = re.sub('MVP[0-9]*','MVP',s)

Here is a sample function to get a list of the common parts:

def foo(mydict):
    return [re.sub('MVP[0-9]*', 'MVP', k.split('-')[0]) for k in mydict]

You can use the index() function to find your dashes, then with that knowledge you can take the rest of the string past that point. For instance,

mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
          "g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
          "g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
          "g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
          "g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
          "g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,
          "g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 6,
          "h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG_MIX-CMVP1_Y1000-FIX.txt" : 7,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG_MIX-CMVP2_Y1000-FIX.txt" : 8,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG_MIX-CMVP3_Y1000-FIX.txt" : 9}

for value in sorted(mydict.iterkeys()):
        index = value.index('-')
        extracted = value[index+1:-4] # Goes past the first occurrence of - and removes .txt from the end
        print extracted[-3:] # Find the last 3 letters in the string

Will print the following:

MIX
MIX
MIX
MIX
MIX
MIX
MIX
FIX
FIX
FIX

Then if statements can be used to do what you would like.

If you want to extract just the common part.

index = value.index('-')
extracted = value[:index] # Will get g18_84pp_2A_MVP1_GoodiesT0

Then to figure out the ending to use. If you know the ending of the mydict value will always be MIX.txt or FIX.txt then you can do this.

for value in sorted(mydict.iterkeys()):
    ending = value[-7:-4]
    index = value.index('-')
    extracted = value[:index]
    print "%s_%s" % (extracted, ending)

Which prints

g18_84pp_2A_MVP1_GoodiesT0_MIX
g18_84pp_2A_MVP2_GoodiesT0_MIX
g18_84pp_2A_MVP3_GoodiesT0_MIX
g18_84pp_2A_MVP4_GoodiesT0_MIX
g18_84pp_2A_MVP5_GoodiesT0_MIX
g18_84pp_2A_MVP6_GoodiesT0_MIX
g18_84pp_2A_MVP7_GoodiesT0_MIX
h18_84pp_3A_MVP1_GoodiesT1_FIX
h18_84pp_3A_MVP2_GoodiesT1_FIX
h18_84pp_3A_MVP2_GoodiesT1_FIX

Then you add it to the extracted dictionary.

Thanks for the answers. My complete code as following. Any suggestion to optimize it?

import re

mydict = {"g18_84pp_2A_MVP1_GoodiesT0-HKJ-DFG_MIX-CMVP1_Y1000-MIX.txt" : 0,
          "g18_84pp_2A_MVP2_GoodiesT0-HKJ-DFG_MIX-CMVP2_Y1000-MIX.txt" : 1,
          "g18_84pp_2A_MVP3_GoodiesT0-HKJ-DFG_MIX-CMVP3_Y1000-MIX.txt" : 2,
          "g18_84pp_2A_MVP4_GoodiesT0-HKJ-DFG_MIX-CMVP4_Y1000-MIX.txt" : 3,
          "g18_84pp_2A_MVP5_GoodiesT0-HKJ-DFG_MIX-CMVP5_Y1000-MIX.txt" : 4,
          "g18_84pp_2A_MVP6_GoodiesT0-HKJ-DFG_MIX-CMVP6_Y1000-MIX.txt" : 5,    
          "h18_84pp_3A_MVP1_GoodiesT1-HKJ-DFG-CMVP1_Y1000-FIX.txt" : 6,    
          "g18_84pp_2A_MVP7_GoodiesT0-HKJ-DFG_MIX-CMVP7_Y1000-MIX.txt" : 7,
          "h18_84pp_3A_MVP2_GoodiesT1-HKJ-DFG-CMVP2_Y1000-FIX.txt" : 8,
          "h18_84pp_3A_MVP3_GoodiesT1-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 9,
          "p18_84pp_2B_MVP1_GoodiesT2-HKJ-DFG-CMVP3_Y1000-FIX.txt" : 10}

ExtractDict = {}
start = 0
for stringList in sorted(mydict.iterkeys()):
    stringList = stringList.split('.')[0]  
    underscore = stringList.split('_')   
    Area= re.split('[0-9]+',stringList.split('_')[3])[0] # MVP and etc.       
    CaseNameString=underscore[0]+"_"+underscore[1]+"_"+underscore[2]+"_"+Area #g18_84pp_2A_MVP_GoodiesT0 and etc.
    postfix= stringList.split('-')[4]
    Newstring= CaseNameString + "_" + postfix   
    ExtractDict[Newstring]= start
    start += 1
startagain =0
OutputNameDict = {}
for OutputNameList in sorted(ExtractDict.iterkeys()):
    OutputNameDict[OutputNameList] = startagain
    startagain +=1

#OutputNameDict = {'h18_84pp_3A_MVP_FIX': 1, 'p18_84pp_2B_MVP_FIX': 2, 'g18_84pp_2A_MVP_MIX': 0}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM