简体   繁体   中英

vectorized string manipulation in pandas dataframe

I have a large DataFrame, something like

import pandas as pd

sqldate = pd.Series(["2014-0-1", "2015-10-10", "1990-23-2"])
pdf = pd.Series(["2014.pdf", "2015.pdf", "1999.pdf"])

df = pd.DataFrame({"sqldate":sqldate, "pdf": pdf})

I want to create a boolean column that indicates whether the year of sqldate is same as year of the pdf name.

Another situation where a forloop is easy to do this, but I'd like to vectorize it for speed/cleanliness. But I cannot figure out how.

I have tried simpler approaches, even just making a df['newcol'] and try to strip the left four characters from date. like df['newcol'] = df['sqldate'][0:4] but that fails. It just makes the first four rows of newcol = sqldate, and the rest of the rows Nan, because it interprets the [0:4] as an index selector.

Any suggestions for a more elegant, vectorized way to use manipulated string values on a dataframe?

You can use Series.str to use string functions on the column. Thus df['sqldate'].str[0:4] would extract the first 4 characters (if they exist), and the following checks if the first four characters of both columns (pdf and sqldate) are the same, and it puts the result in 'newcol':

df['newcol'] = df['sqldate'].str[0:4]==df['pdf'].str[0:4]

See more about the string functions:

http://pandas.pydata.org/pandas-docs/stable/text.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM