简体   繁体   中英

how to count the number of rows following a particular values in the same column in a dataframe

Consider I have the following dataframe:

tempDic = {0: {0: 'class([1,0,0,0],"Small-molecule metabolism ").', 1: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 2: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 3: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 4: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 5: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 6: 'class([1,1,0,0],"Degradation ").', 7: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 8: 'function(tb536,[1,1,1,0],\'galE2\',"UDP-glucose 4-epimerase").', 9: 'function(tb620,[1,1,1,0],\'galK\',"galactokinase").', 10: 'function(tb619,[1,1,1,0],\'galT\',"galactose-1-phosphate uridylyltransferase C-term").', 11: 'class([1,1,1,0],"Carbon compounds ").', 12: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 13: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 14: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 15: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 16: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 17: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 18: 'class([1,1,1,0],"xyz ").'}}

df = pd.DataFrame(tempDic)
print(df)

 

                                               0
0      class([1,0,0,0],"Small-molecule metabolism ").
1   function(tb186,[1,1,1,0],'bglS',"beta-glucosid...
2   function(tb2202,[1,1,1,0],'cbhK',"carbohydrate...
3   function(tb727,[1,1,1,0],'fucA',"L-fuculose ph...
4   function(tb1731,[1,1,1,0],'gabD1',"succinate-s...
5   function(tb234,[1,1,1,0],'gabD2',"succinate-se...
6                    class([1,1,0,0],"Degradation ").
7   function(tb501,[1,1,1,0],'galE1',"UDP-glucose ...
8   function(tb536,[1,1,1,0],'galE2',"UDP-glucose ...
9   function(tb620,[1,1,1,0],'galK',"galactokinase").
10  function(tb619,[1,1,1,0],'galT',"galactose-1-p...
11              class([1,1,1,0],"Carbon compounds ").
12  function(tb186,[1,1,1,0],'bglS',"beta-glucosid...
13  function(tb2202,[1,1,1,0],'cbhK',"carbohydrate...
14  function(tb727,[1,1,1,0],'fucA',"L-fuculose ph...
15  function(tb1731,[1,1,1,0],'gabD1',"succinate-s...
16  function(tb234,[1,1,1,0],'gabD2',"succinate-se...
17  function(tb501,[1,1,1,0],'galE1',"UDP-glucose ...
18                           class([1,1,1,0],"xyz ").

What I need is a strategy that will give me a result like this:

Class                         Count
Small-molecule metabolism       5
Degradation                     4
Carbon compounds                6
xyz                             0

Each row that starts with "class" contains the name of the class in double quotes, for example, "Small-molecule metabolism" in the first row. This row is then followed by rows starting with "function". We just need to count those rows that start with "function" and put that count in front of that class name. A class that is not followed by "function" rows should be assigned the value of 0, meaning that the class has zero functions.

Use Series.str.startswith for mask, get values between "" by Series.str.extract and after forward filling missing values use GroupBy.size with subtract 1 :

df['Class'] = df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False)

df['Class'] = df['Class'].ffill()

s = df.groupby('Class', sort=False).size().sub(1).reset_index(name='Count')
print (s)
                        Class  Count
0  Small-molecule metabolism       5
1                Degradation       4
2           Carbon compounds       6
3                        xyz       0

Details of steps:

print(df.loc[df[0].str.startswith('class'), 0])
0     class([1,0,0,0],"Small-molecule metabolism ").
6                   class([1,1,0,0],"Degradation ").
11             class([1,1,1,0],"Carbon compounds ").
18                          class([1,1,1,0],"xyz ").
Name: 0, dtype: object

print (df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False))    
0     Small-molecule metabolism 
6                   Degradation 
11             Carbon compounds 
18                          xyz 
Name: 0, dtype: object

df['Class'] = df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False)
print (df['Class'])
0     Small-molecule metabolism 
1                            NaN
2                            NaN
3                            NaN
4                            NaN
5                            NaN
6                   Degradation 
7                            NaN
8                            NaN
9                            NaN
10                           NaN
11             Carbon compounds 
12                           NaN
13                           NaN
14                           NaN
15                           NaN
16                           NaN
17                           NaN
18                          xyz 
Name: Class, dtype: object

df['Class'] = df['Class'].ffill()
print (df['Class'])
0     Small-molecule metabolism 
1     Small-molecule metabolism 
2     Small-molecule metabolism 
3     Small-molecule metabolism 
4     Small-molecule metabolism 
5     Small-molecule metabolism 
6                   Degradation 
7                   Degradation 
8                   Degradation 
9                   Degradation 
10                  Degradation 
11             Carbon compounds 
12             Carbon compounds 
13             Carbon compounds 
14             Carbon compounds 
15             Carbon compounds 
16             Carbon compounds 
17             Carbon compounds 
18                          xyz 
Name: Class, dtype: object

print (df.groupby('Class', sort=False).size())
Class
Small-molecule metabolism     6
Degradation                   5
Carbon compounds              7
xyz                           1
dtype: int64

df1 = df.groupby('Class', sort=False).size().sub(1).reset_index(name='Count')
print (df1)
                        Class  Count
0  Small-molecule metabolism       5
1                Degradation       4
2           Carbon compounds       6
3                        xyz       0

Here you go:

import re
tempDic = {0: {0: 'class([1,0,0,0],"Small-molecule metabolism ").', 1: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 2: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 3: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 4: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 5: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 6: 'class([1,1,0,0],"Degradation ").', 7: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 8: 'function(tb536,[1,1,1,0],\'galE2\',"UDP-glucose 4-epimerase").', 9: 'function(tb620,[1,1,1,0],\'galK\',"galactokinase").', 10: 'function(tb619,[1,1,1,0],\'galT\',"galactose-1-phosphate uridylyltransferase C-term").', 11: 'class([1,1,1,0],"Carbon compounds ").', 12: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 13: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 14: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 15: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 16: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 17: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 18: 'class([1,1,1,0],"xyz ").'}}
df = pd.DataFrame(tempDic)
df.columns = ['text']
df = df.loc[df.text.str.startswith('class', na=False)] # leave only rows starting with 'class'
df['class'] = df['text'].apply(lambda x: re.findall(r"['\"](.*?)['\"]", x)[0]) # Extract the value between the double quotes
df.groupby(['class']).count() # Count the classes

Try this:

import re
from itertools import compress
tempDic = {0: {0: 'class([1,0,0,0],"Small-molecule metabolism ").', 1: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 2: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 3: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 4: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 5: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 6: 'class([1,1,0,0],"Degradation ").', 7: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 8: 'function(tb536,[1,1,1,0],\'galE2\',"UDP-glucose 4-epimerase").', 9: 'function(tb620,[1,1,1,0],\'galK\',"galactokinase").', 10: 'function(tb619,[1,1,1,0],\'galT\',"galactose-1-phosphate uridylyltransferase C-term").', 11: 'class([1,1,1,0],"Carbon compounds ").', 12: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 13: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 14: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 15: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 16: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 17: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 18: 'class([1,1,1,0],"xyz ").'}}

df = pd.DataFrame(tempDic)


df_final=pd.DataFrame()

df_final['class']=[i[0] for i in list(compress([re.findall('"([^"]*)"',i) for i in df[0]],[df[0].str.contains('class').tolist()][0]))]
df_final['count']=pd.Series(df[df[0].str.contains('class')].index).diff().dropna().reset_index(drop=True).sub(1)
df_final['count'].fillna(0,inplace=True)

output:

df_final
Out[165]: 
                        class  count
0  Small-molecule metabolism     5.0
1                Degradation     4.0
2           Carbon compounds     6.0
3                        xyz     0.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM