I was wondering:
1.) how do I find a specific string in a column
2.) given that string, how would I find it's corresponding max
3.) How do I count the number of strings for each row in that column
I have a csv file called sports.csv
import pandas as pd
import numpy as np
#loading the data into data frame
X = pd.read_csv('sports.csv')
the two columns of interest are the Totals
and Gym
column:
Total Gym
40 Football|Baseball|Hockey|Running|Basketball|Swimming|Cycling|Volleyball|Tennis|Ballet
37 Baseball|Tennis
61 Basketball|Baseball|Ballet
12 Swimming|Ballet|Cycling|Basketball|Volleyball|Hockey|Running|Tennis|Baseball|Football
78 Swimming|Basketball
29 Baseball|Tennis|Ballet|Cycling|Basketball|Football|Volleyball|Swimming
31 Tennis
54 Tennis|Football|Ballet|Cycling|Running|Swimming|Baseball|Basketball|Volleyball
33 Baseball|Hockey|Swimming|Cycling
17 Football|Hockey|Volleyball
Notice that the Gym
column has multiple strings for each corresponding sport.I'm trying to find a way to find all of the gyms that have Baseball and find the one with the max total. However, I'm only interested in gyms that have at least two other sports ie I wouldn't want to consider:
Total Gym
37 Baseball|Tennis
You can easily do this using pandas
First, split the strings into a list on the tab delimiter followed by iterating over the list and choosing the ones with the length greater than 2 as you would want baseball along with two other sports as the criteria.
In [4]: df['Gym'] = df['Gym'].str.split('|').apply(lambda x: ' '.join([i for i in x if len(x)>2]))
In [5]: df
Out[5]:
Total Gym
0 40 Football Baseball Hockey Running Basketball Sw...
1 37
2 61 Basketball Baseball Ballet
3 12 Swimming Ballet Cycling Basketball Volleyball ...
4 78
5 29 Baseball Tennis Ballet Cycling Basketball Foot...
6 31
7 54 Tennis Football Ballet Cycling Running Swimmin...
8 33 Baseball Hockey Swimming Cycling
9 17 Football Hockey Volleyball
Using str.contains
to search for the string Baseball
in the column Gym
.
In [6]: df = df.loc[df['Gym'].str.contains('Baseball')]
In [7]: df
Out[7]:
Total Gym
0 40 Football Baseball Hockey Running Basketball Sw...
2 61 Basketball Baseball Ballet
3 12 Swimming Ballet Cycling Basketball Volleyball ...
5 29 Baseball Tennis Ballet Cycling Basketball Foot...
7 54 Tennis Football Ballet Cycling Running Swimmin...
8 33 Baseball Hockey Swimming Cycling
Compute respective string counts.
In [8]: df['Count'] = df['Gym'].str.split().apply(lambda x: len([i for i in x]))
Followed by choosing the subset of the dataframe corresponding to the maximum value in the Totals
column.
In [9]: df.loc[df['Total'].idxmax()]
Out[9]:
Total 61
Gym Basketball Baseball Ballet
Count 3
Name: 2, dtype: object
You can do it in one pass as you read the file:
import csv
with open("sport.csv") as f:
mx, best = float("-inf"), None
for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
row[1:] = row[1].split("|")
if "Baseball" in row and len(row[1:]) > 2 and int(row[0]) > mx:
mx = int(row[0])
best = row
if best:
print(best, mx, len(row[1:]))
Which would give you:
(['61', 'Basketball', 'Baseball', 'Ballet'], 61, 3)
Another way without splitting would be to count the pipe chars:
import csv
with open("sports.csv") as f:
mx, best = float("-inf"),None
for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
print(row[1])
if "Baseball" in row[1] and row[1].count("|") > 1 and int(row[0]) > mx:
mx = int(row[0])
best = row
if best:
print(best, mx, row[1].count("|"))
That means though a substring could potentially be matched as opposed to an exact word.
Try This:
df3.loc[(df3['Gym'].str.contains('Hockey') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)
Total Gym
0 40 Football|Baseball|Hockey|Running|Basketball|Sw...
df3.loc[(df3['Gym'].str.contains('Baseball') == True) & (df3["Gym"].str.count("\|")>1)].sort_values("Total").tail(1)
Total Gym
2 61 Basketball|Baseball|Ballet
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.