简体   繁体   中英

Parsing Natural Language Music Citations Using Regex

I am struggling with nailing down a fairly complex regular expression to parse song titles with optional artist attribution from loosely-typed English. The user input comes from a single text field and the regex matches will be used to query a song database to get unique track IDs. I need to be able to get these matches:

  • \1 = song title
  • \2 = artist

while being fairly liberal in allowed formats.

Examples

The wold "by" should split the string into song title and artist (but only on word boundaries); as should a comma with/without trailing whitespace:

baby one more time by britney spears

baby one more time, britney spears

baby one more time,britney spears

  • \1 = baby one more time
  • \2 = britney spears

False positives like these are acceptable:

down by the bay

  • \1 = down
  • \2 = the bay

whatever people say i am, that's what i'm not

  • \1 = whatever people say i am
  • \2 = that's what i'm not

…assuming quotes can be used to mark a run of text as a song title explicitly:

"down by the bay"

  • \1 = down by the bay
  • \2 not matched

"whatever people say i am, that's what i'm not" by arctic monkeys

  • \1 = whatever people say i am, that's what i'm not
  • \2 = arctic monkeys

Single quotes should work too, but obviously not if they appear within the title:

'whatever people say i am, that's what i'm not'

  • \1 = whatever people say i am, that
  • \2 = s what i'm not'

Additionally, if quotes are in use, the word "by" or a comma are optional:

"down by the bay" raffi

  • \1 = down by the bay
  • \2 = raffi

However, if there are no quotes, and more than one "by", then only the last "by" should be used as a delimiter:

down by the bay by raffi

  • \1 = down by the bay
  • \2 = raffi

Is this even possible with a single regex? Or would the more sane way be to split it up into multiple expressions? Either way, what might this look like?

Here is an example, using C#:

var regex = @"^((""(?<title>[^""]+)""|'(?<title>[^']+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$";

var items = new []{
    "baby one more time by britney spears",
    "baby one more time, britney spears",
    "baby one more time,britney spears",
    "down by the bay",
    "whatever people say i am, that's what i'm not",
    "\"down by the bay\"",
    "\"whatever people say i am, that's what i'm not\" by arctic monkeys",
    "'whatever people say i am, that's what i'm not'",
    "\"down by the bay\" raffi",
    "down by the bay by raffi",
};

foreach (var item in items)
{
    var match = Regex.Match(item, regex, RegexOptions.ExplicitCapture);
    Console.WriteLine(match.Groups["title"] + " - " + match.Groups["artist"]);
}

Output matches your specification, as far as I can tell:

baby one more time - britney spears
baby one more time - britney spears
baby one more time - britney spears
down - the bay
whatever people say i am - that's what i'm not
down by the bay - 
whatever people say i am, that's what i'm not - arctic monkeys
whatever people say i am, that - s what i'm not'
down by the bay - raffi
down by the bay - raffi

You can actually make it better for the single-quote case by allowing apostrophes inside words:

var regex = @"^((""(?<title>[^""]+)""|'(?<title>([^']|(?<=\w)'(?=\w))+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$";

Which fixes this case:

whatever people say i am, that's what i'm not - 

Here's a commented version of the regex, which explains what each part does (should be matched with RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace ):

var regex = @"
^
  (
    (
      ""(?<title>[^""]+)""               (?# matches a double-quote string )
    | '(?<title>([^']|(?<=\w)'(?=\w))+)' (?# matches a single-quote string, allowing quotes in words )
    ) (\s*,\s*|\s+by\s+)?   (?# optionally follow these by ',' or 'by' )
  | 
  (?<title>.*)(\s*,\s*|\s+by\s+) (?# otherwise, everything up to ',' or 'by' )
)
\s*(?<artist>.*) (?# everything after this is the artist name )
$";

Edit:

I've played around a bit with the PHP code, but I can't get it to use named capturing groups properly. Here is a version using unnamed capturing groups:

$regex = "/^(?:(?:\"([^\"]+)\"|'((?:[^']|(?<=\\w)'(?=\\w))+)')(?:\\s*,\\s*|\\s+by\\s+)?|(.*)(?:\\s*,\\s*|\\s+by\\s+))\s*(.*)\$/";

preg_match($regex, '"down by the river"', $matches);

print_r($matches);

The title will be in group 1, 2, or 3, and the artist in group 4.

Based on the examples you've posted, I certainly wouldn't try to write a single regex for all cases, unless there was some compelling reason to do so. Writing such an expression, which I do imagine is possible, would be very brittle, and would likely be a hassle to maintain.

Sounds like you just have some simple rule-based processing, which I would treat as such. You could have each of the individual rule be a regex, store them in whatever order you like, and then as you got more experience with processing you could try to figure out whether there was a better order, perhaps depending on the percentage that were parsed the way you would like.

Just iteratively try to refine your rules; you might start to notice more complex patterns and you could expand your rules classes to take multiple steps into account for one rule, eg perhaps you notice that for a particular rule, it's failing, but that if you were to add an additional check to that rule you could weed out most of the failures.

As for each regex, I think probably simplest is best, and none of the individual rules would likely need to be that complicated, especially at first. Regular expressions are very powerful tools, but I wouldn't focus too much on trying to shoehorn something like parsing natural language into something that is more well-suited for parsing well-defined formal languages. (Thus, the "regular" part.)

One more idea that comes to me off the top of my head would be to consider that you might find in certain cases that running some sort of conformance on the input text could make the processing easier, for instance by reducing the number of cases you have to process. To use a (possibly good or bad) example from the provided examples, instead of having a rule to process X by Y and a rule to process X, Y and a rule to process "X" Y , you could run a filter that replaces by[space] with , one that replaces ,[space] with , and one that replaces "X"[space] with X, . Then at the end you're only left with X,Y which means you only have to process the one case. Likely too simplistic of an example to be useful, but it's a good pattern to be able to search for; sometimes conformance can greatly simplify this kind of processing.

I would go a more statistical/spam-filter way and reduce the natural language to an array of words, then measure the distance among the words that compose the title and the artist's name.

In regexp terms this may mean transforming every normal word ( \w+ ) in a single - and every word in the title and author in a !

But that's just a fancy way to visualize word runs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM