If cell has 2 words, extract only 1st word and if cell has 3 words, extract 2 first words - PANDAS/REGEX

Question

In my DataFrame, I have column named 'teams'. It includes the city and team name. I'd want to extract the city into another column. Here is the dataframe: DataFrame sample

nba_df['team'].head(11)
    team
0   Toronto Raptors
1   Boston Celtics
2   Philadelphia 76ers
3   Cleveland Cavaliers
4   Indiana Pacers
5   Miami Heat
6   Milwaukee Bucks
7   Washington Wizards
8   Detroit Pistons
9   Charlotte Hornets
10  New York Knicks

I could easily extract the column using regex:

nba_df['cities'] = nba_df.team.str.extract('(^[\w*]+)', expand=True)
nba_df[['team', 'cities']].head(11)


    team                cities
0   Toronto Raptors     Toronto
1   Boston Celtics      Boston
2   Philadelphia 76ers  Philadelphia
3   Cleveland Cavaliers Cleveland
4   Indiana Pacers      Indiana
5   Miami Heat          Miami
6   Milwaukee Bucks     Milwaukee
7   Washington Wizards  Washington
8   Detroit Pistons     Detroit
9   Charlotte Hornets   Charlotte
10  New York Knicks     New

However, in the column 'names', for New York Knicks, it gives me only the value of: "New" and I'd like to get "New York":

Result

So, how should I do, if the cell has 2words, how can I extract only one word from the beginning and if the cell has 3words, how can I extract 2words from it using regex?

Answer 1

For your scenario, where you have just 2 or 3 word strings, you can use

^(\S+(?:\s+\S+(?=\s+\S+))?)

See the regex demo .

Details

^ - start of string
(\S+(?:\s+\S+(?=\s+\S+))?) - Capturing group 1:
- \S+ - one or more non-whitespace chars
- (?:\s+\S+(?=\s+\S+))? - an optional sequence of
  - \s+ - 1+ whitespaces
  - \S+ - 1+ non-whitespaces
  - (?=\s+\S+) - that is immediately followed with 1+ whitespaces and 1+ non-whitespaces.

Here are some other regex options:

All words but the last : ^(\S+(?:\s+\S+)*)\s+\S+$ ( demo ) / ^(.*\S)\s+\S+$ ( demo ) / ^(.*?)\s+\S+$ ( demo )
Get the first word in two-word string and two first words in a three word string, and no match in other strings : ^(\S+(?=\s+\S+$)|\S+\s+\S+(?=\s+\S+$)) ( demo )

Answer 2

Don't struggle with regex for this, unless you find it very readable. Instead, starting with the string team_name ... split, slice, and join:

team_words = team_name.split()
team_city = team_words[:-1]
city = ' '.join(team_city)

In one line:

city = ' '.join(team_name.split()[:-1])

Can you slip that readily into your DF broadcast?

If cell has 2 words, extract only 1st word and if cell has 3 words, extract 2 first words - PANDAS/REGEX

Question

2 answers

solution1
2 ACCPTED 2020-10-24 17:25:43

solution2
-1 2020-10-24 17:18:15

If cell has 2 words, extract only 1st word and if cell has 3 words, extract 2 first words - PANDAS/REGEX

Question

2 answers

solution1 2 ACCPTED 2020-10-24 17:25:43

solution2 -1 2020-10-24 17:18:15

solution1
2 ACCPTED 2020-10-24 17:25:43

solution2
-1 2020-10-24 17:18:15