vectorized string manipulation in pandas dataframe

Question

I have a large DataFrame, something like

import pandas as pd

sqldate = pd.Series(["2014-0-1", "2015-10-10", "1990-23-2"])
pdf = pd.Series(["2014.pdf", "2015.pdf", "1999.pdf"])

df = pd.DataFrame({"sqldate":sqldate, "pdf": pdf})

I want to create a boolean column that indicates whether the year of sqldate is same as year of the pdf name.

Another situation where a forloop is easy to do this, but I'd like to vectorize it for speed/cleanliness. But I cannot figure out how.

I have tried simpler approaches, even just making a df['newcol'] and try to strip the left four characters from date. like df['newcol'] = df['sqldate'][0:4] but that fails. It just makes the first four rows of newcol = sqldate, and the rest of the rows Nan, because it interprets the [0:4] as an index selector.

Any suggestions for a more elegant, vectorized way to use manipulated string values on a dataframe?

Answer 1

You can use Series.str to use string functions on the column. Thus df['sqldate'].str[0:4] would extract the first 4 characters (if they exist), and the following checks if the first four characters of both columns (pdf and sqldate) are the same, and it puts the result in 'newcol':

df['newcol'] = df['sqldate'].str[0:4]==df['pdf'].str[0:4]

See more about the string functions:

http://pandas.pydata.org/pandas-docs/stable/text.html

vectorized string manipulation in pandas dataframe

Question

1 answers

solution1
8 ACCPTED 2015-11-10 07:56:25

vectorized string manipulation in pandas dataframe

Question

1 answers

solution1 8 ACCPTED 2015-11-10 07:56:25

solution1
8 ACCPTED 2015-11-10 07:56:25