Extracting the 'year' from a list of strings in a pandas data frame

Question

I have a pandas data set with a column named ['title'] and string values such as "Robert Hall 2015 Viognier", and "Woodinville Wine Cellars 2012 Reserve". I am trying to iterate through each row to extract the year as an integer, however the strings differ from each other and the years are not all in the same spots.

Answer 1

You can use the str.extract method with a regex:

df['title'].str.extract('\d{4}').astype(int)

Here is a crash course on regular expressions (look on the right for "lesson notes" for a summary).

Answer 2

Please post your code. Here is a tip:

import re

mystring =  "Woodinville Wine Cellars 2012 Reserve"

match = re.search('\d{4}', mystring )
print(match.group(0))
'2012'

This will work for any string that contains the date in 4 digit format.

Answer 3

You can use regular expressions to check if the string contains 4 digits in a row, and use match to extract them.

/**
 * Get a year from the given title.
 * @param {string} title The title to extract the year from.
 * @returns {?number} The extracted year. If undefined is returned a year could not be found.
 */
function getYearFromTitle (title)
{
    // Make sure that the title is a string
    if (typeof title !== "string") throw new Error("Typeof title must be a string!");

    // Do a regular expression search for 4 digits
    const results = title.match(/\d{4}/);

    // If results is null, return undefined.
    if (!results) return;

    // Return the first occurance of 4 digits as a number.
    return Number(results[0]);
}

Note : This is JavaScript code, you'd have to write the equivalent in python.

Extracting the 'year' from a list of strings in a pandas data frame

Question

3 answers

solution1
1 2019-11-25 21:59:23

solution2
0 2019-11-25 21:51:50

solution3
0 2019-11-25 21:59:31

Extracting the 'year' from a list of strings in a pandas data frame

Question

3 answers

solution1 1 2019-11-25 21:59:23

solution2 0 2019-11-25 21:51:50

solution3 0 2019-11-25 21:59:31

solution1
1 2019-11-25 21:59:23

solution2
0 2019-11-25 21:51:50

solution3
0 2019-11-25 21:59:31