简体   繁体   中英

Finding specific strings within a column and finding the max corresponding to that string

I was wondering:

1.) how do I find a specific string in a column
2.) given that string, how would I find it's corresponding max
3.) How do I count the number of strings for each row in that column

I have a csv file called sports.csv

 import pandas as pd
 import numpy as np

#loading the data into data frame
X = pd.read_csv('sports.csv')

the two columns of interest are the Totals and Gym column:

 Total  Gym
40  Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
37  Baseball|Tennis
61  Basketball|Baseball|Ballet
12  Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
78  Swimming|Basketball
29  Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
31  Tennis
54  Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
33  Baseball|Hockey|Swimming|Cycling
17  Football|Hockey|Volleyball

Notice that the Gym column has multiple strings for each corresponding sport.I'm trying to find a way to find all of the gyms that have Baseball and find the one with the max total. However, I'm only interested in gyms that have at least two other sports ie I wouldn't want to consider:

  Total   Gym
  37    Baseball|Tennis

You can easily do this using pandas

First, split the strings into a list on the tab delimiter followed by iterating over the list and choosing the ones with the length greater than 2 as you would want baseball along with two other sports as the criteria.

In [4]: df['Gym'] = df['Gym'].str.split('|').apply(lambda x: ' '.join([i for i in x if len(x)>2]))

In [5]: df
Out[5]: 
   Total                                                Gym
0     40  Football Baseball Hockey Running Basketball Sw...
1     37                                                   
2     61                         Basketball Baseball Ballet
3     12  Swimming Ballet Cycling Basketball Volleyball ...
4     78                                                   
5     29  Baseball Tennis Ballet Cycling Basketball Foot...
6     31                                                   
7     54  Tennis Football Ballet Cycling Running Swimmin...
8     33                   Baseball Hockey Swimming Cycling
9     17                         Football Hockey Volleyball

Using str.contains to search for the string Baseball in the column Gym .

In [6]: df = df.loc[df['Gym'].str.contains('Baseball')]

In [7]: df
Out[7]: 
   Total                                                Gym
0     40  Football Baseball Hockey Running Basketball Sw...
2     61                         Basketball Baseball Ballet
3     12  Swimming Ballet Cycling Basketball Volleyball ...
5     29  Baseball Tennis Ballet Cycling Basketball Foot...
7     54  Tennis Football Ballet Cycling Running Swimmin...
8     33                   Baseball Hockey Swimming Cycling

Compute respective string counts.

In [8]: df['Count'] = df['Gym'].str.split().apply(lambda x: len([i for i in x]))

Followed by choosing the subset of the dataframe corresponding to the maximum value in the Totals column.

In [9]: df.loc[df['Total'].idxmax()]
Out[9]: 
Total                            61
Gym      Basketball Baseball Ballet
Count                             3
Name: 2, dtype: object

You can do it in one pass as you read the file:

import csv
with open("sport.csv") as f:
    mx, best = float("-inf"), None
    for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
        row[1:] = row[1].split("|")
        if "Baseball" in row and len(row[1:]) > 2 and int(row[0]) > mx:
            mx = int(row[0])
            best = row
    if best:
        print(best, mx, len(row[1:]))

Which would give you:

(['61', 'Basketball', 'Baseball', 'Ballet'], 61, 3)

Another way without splitting would be to count the pipe chars:

import csv
with open("sports.csv") as f:
    mx, best = float("-inf"),None
    for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
        print(row[1])
        if "Baseball" in row[1] and row[1].count("|") > 1 and int(row[0]) > mx:
            mx = int(row[0])
            best = row
    if best:
        print(best, mx, row[1].count("|"))

That means though a substring could potentially be matched as opposed to an exact word.

Try This:

df3.loc[(df3['Gym'].str.contains('Hockey') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)

 Total                                                Gym
0     40  Football|Baseball|Hockey|Running|Basketball|Sw...


df3.loc[(df3['Gym'].str.contains('Baseball') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)

   Total                         Gym
2     61  Basketball|Baseball|Ballet

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM