简体   繁体   中英

If cell has 2 words, extract only 1st word and if cell has 3 words, extract 2 first words - PANDAS/REGEX

In my DataFrame, I have column named 'teams'. It includes the city and team name. I'd want to extract the city into another column. Here is the dataframe: DataFrame sample

nba_df['team'].head(11)
    team
0   Toronto Raptors
1   Boston Celtics
2   Philadelphia 76ers
3   Cleveland Cavaliers
4   Indiana Pacers
5   Miami Heat
6   Milwaukee Bucks
7   Washington Wizards
8   Detroit Pistons
9   Charlotte Hornets
10  New York Knicks

I could easily extract the column using regex:

nba_df['cities'] = nba_df.team.str.extract('(^[\w*]+)', expand=True)
nba_df[['team', 'cities']].head(11)


    team                cities
0   Toronto Raptors     Toronto
1   Boston Celtics      Boston
2   Philadelphia 76ers  Philadelphia
3   Cleveland Cavaliers Cleveland
4   Indiana Pacers      Indiana
5   Miami Heat          Miami
6   Milwaukee Bucks     Milwaukee
7   Washington Wizards  Washington
8   Detroit Pistons     Detroit
9   Charlotte Hornets   Charlotte
10  New York Knicks     New

However, in the column 'names', for New York Knicks, it gives me only the value of: "New" and I'd like to get "New York":

Result

So, how should I do, if the cell has 2words, how can I extract only one word from the beginning and if the cell has 3words, how can I extract 2words from it using regex?

For your scenario, where you have just 2 or 3 word strings, you can use

^(\S+(?:\s+\S+(?=\s+\S+))?)

See the regex demo .

Details

  • ^ - start of string
  • (\S+(?:\s+\S+(?=\s+\S+))?) - Capturing group 1:
    • \S+ - one or more non-whitespace chars
    • (?:\s+\S+(?=\s+\S+))? - an optional sequence of
      • \s+ - 1+ whitespaces
      • \S+ - 1+ non-whitespaces
      • (?=\s+\S+) - that is immediately followed with 1+ whitespaces and 1+ non-whitespaces.

Here are some other regex options:

  • All words but the last : ^(\S+(?:\s+\S+)*)\s+\S+$ ( demo ) / ^(.*\S)\s+\S+$ ( demo ) / ^(.*?)\s+\S+$ ( demo )
  • Get the first word in two-word string and two first words in a three word string, and no match in other strings : ^(\S+(?=\s+\S+$)|\S+\s+\S+(?=\s+\S+$)) ( demo )

Don't struggle with regex for this, unless you find it very readable. Instead, starting with the string team_name ... split, slice, and join:

team_words = team_name.split()
team_city = team_words[:-1]
city = ' '.join(team_city)

In one line:

city = ' '.join(team_name.split()[:-1])

Can you slip that readily into your DF broadcast?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM