简体   繁体   中英

Pandas series: Delete everything before a certain character, if "everything" changes everytime

I know questions like this one have been asked in abundance, but I haven't found one that answers mine (maybe I oversaw sth, but I gave it my best;) ). Here's the problem: I have a pandas series like this:

ingredssplit
    0                          MAGERMILCH 65%
    1                                  Wasser
    2            Keks gemahlen 6% (WEIZENMEHL
    3                   Traubensaftkonzentrat
    4                                 Palmöl)
    5                                  Stärke
    6                              Maiskeimöl
    7                                  Zucker
    8     Antioxidationsmittel Ascorbinsäure¹
    9                  Thiamin (Vitamin B1). 
    dtype: object``

Now I want to remove everything in line 2 before the bracket. But this part changes everytime, sometimes it's "Keks gemahlen 6%", sometimes it's sth completly different. The only thing that is constant in line 2 before the "(" is the "%". So another possibility would be "abc de% (". How can I remove that part? My research brought me to the regular expressions operator and continuing, to this line:

for line in ingredssplit:
print(re.sub())

But now I don't know how to fill the code bracket correctly, so everything is named before "(Weizenmehl". Maybe there's also another way? Also, how do I remove the superscript 1 at "Ascorbinsäure"? Thanks guys, have a nice we!

Try str.extract :

df.loc[[2], 'ingredssplit'] = (
    df.loc[[2], 'ingredssplit'].str.extract('.*\((.*)')[0]
)

Okay, I found a solution. Thanks jcaliz, the '.*\( part was golden: This is what I did:

   item1 = []
   for line in ingredssplit:
       line=re.sub('.*\(', '', line)
       item1.append(line)  
        
    def remove_punc(string):
        punc = '''!()-[]{};:'"\,<>./?@#$^&*_~'''
        for ele in string:  
            if ele in punc:  
                string = string.replace(ele, "") 
        return string
    lis = [remove_punc(i) for i in item1]
    lis = list(filter(None, lis))
    lis=[i.lstrip() for i in lis]
    lis=[i.rstrip() for i in lis]
    lis

This gives me a list:

['MAGERMILCH 65%',
 'Wasser',
 'WEIZENMEHL',
 'Traubensaftkonzentrat',
 'Palmöl',
 'Stärke',
 'Maiskeimöl',
 'Zucker',
 'Antioxidationsmittel Ascorbinsäure¹',
 'Vitamin B1']

which I can easily transform into a dataframe eg:

lis=pd.DataFrame(lis)
lis
                 0

0   MAGERMILCH 65%
1   Wasser
2   WEIZENMEHL
3   Traubensaftkonzentrat
4   Palmöl
5   Stärke
6   Maiskeimöl
7   Zucker
8   Antioxidationsmittel Ascorbinsäure¹
9   Vitamin B1

Thanks people: :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM