I have a large DataFrame, something like
import pandas as pd
sqldate = pd.Series(["2014-0-1", "2015-10-10", "1990-23-2"])
pdf = pd.Series(["2014.pdf", "2015.pdf", "1999.pdf"])
df = pd.DataFrame({"sqldate":sqldate, "pdf": pdf})
I want to create a boolean column that indicates whether the year of sqldate is same as year of the pdf name.
Another situation where a forloop is easy to do this, but I'd like to vectorize it for speed/cleanliness. But I cannot figure out how.
I have tried simpler approaches, even just making a df['newcol'] and try to strip the left four characters from date. like df['newcol'] = df['sqldate'][0:4] but that fails. It just makes the first four rows of newcol = sqldate, and the rest of the rows Nan, because it interprets the [0:4] as an index selector.
Any suggestions for a more elegant, vectorized way to use manipulated string values on a dataframe?
You can use Series.str
to use string functions on the column. Thus df['sqldate'].str[0:4]
would extract the first 4 characters (if they exist), and the following checks if the first four characters of both columns (pdf and sqldate) are the same, and it puts the result in 'newcol':
df['newcol'] = df['sqldate'].str[0:4]==df['pdf'].str[0:4]
See more about the string functions:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.