简体   繁体   中英

Regex not matching depending on the number of words

Consider a file containing:

Jesus is friends with Chuck Norris
Cindy Crawford is friends with Nicole Kidman
V is friends with Barack Obama
Chuck Norris is friends with Barack Obama
V is friends with François Hollande
Penelope Cruiz is friends with Tom Cruise
Nicole Kidman is friends with Tom Cruise
Katie Holmes is friends with Tom Cruise
Sim is friends with Lara Croft
Sim is friends with Chuck Norris
Lara Croft is friends with V
Yvette Horner is friends with Sim
François Hollande is friends with Barack Obama
Sim is friends with Jesus
Tom Cruise is friends with Barack Obama

I am trying to match all these lines who are basically formated that way:

first_name (last_name?) 'is friends with' first_name (last_name?)

Basically, some lines can have full names in it, some can have first name and a full name, or a full name and a first name, etc... with "is friends with" in the middle of the sentence.

Here is the current regex I am using in Python:

(\w+ \w+) (is friends with) (\w+ \w+)

but this one only match the "full_name is friends with full_name" lines. I can't seem to find a way to also match those who have two first names, or one full name with one first name, etc..

Any ideas please?

You could add 2 times an optional non capturing group (?: for 2 first names to match a whitespace and one or more times \\w+ (or specify a character class to match more than \\w )

(\\w+(?: \\w+)?) (is friends with) (\\w+(?: \\w+)?)

Regex demo

You could repeat the non capturing group zero or more times using an asterix * instead of a question mark ?

Just include space with your \\w as a group so you capture both the single and full names:

([\w ]+) (is friends with) ([\w ]+)

Regex101 sample

You can use the following to match variable length names:

See regex in use here

(\w+(?: \w+)*) is friends with (\w+(?: \w+)*)
  • (\\w+(?: \\w+)*) Group the following into group 1
    • \\w+ Matches any word character 1 or more times
    • (?: \\w+)* Matches a space followed by one or more word characters, any number of times

Note that \\w matches special characters like ç by default in Python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM