What should be the optimized pandas command to create a new data frame from existing data frame that have only 1 column named val with the following transformation.
Input:
1_2_3
1_2_3_4
1_2_3_4_5
Output:
2
2_3
2_3_4
Remove everything till first underscore (including _) and also remove everything after last _ (including _)
You can use str.replace
with a regex that matches characters up to and including the first _
and from the last _
to the end of string, replacing both those parts with nothing:
df['val'] = df['val'].str.replace('^[^_]*_(.*)_[^_]*$', r'\1')
Output:
val
0 2
1 2_3
2 2_3_4
If you want that single column in a new dataframe, you can convert it to one using to_frame
:
df2 = df['val'].str.replace('^[^_]*_(.*)_[^_]*$', r'\1').to_frame()
Another way with str slicing after split:
df['val'].str.split("_").str[1:-1].str.join("_")
0 2
1 2_3
2 2_3_4
Split the string by the charcters between start of string r1 and r2 end of string
where r1=digit_
and r2=_digit
df.a.str.split('(?<=^\d\_)(.*?)(?=\_\d+$)').str[1]
You can find the first and the last _
using str.find
and str.rfind
and then you can get the substring from it.
df['val'] = [x[x.find('_')+1:x.rfind('_')] for x in df['val']]
Output:
val
0 2
1 2_3
2 2_3_4
You can do it using the replace method
df.vals = df.vals.str.replace(r'^1_', '').str.replace(r'_\d$', '')
I'm passing 2 regex, first one finds the substring 1_ and replaces it with empty string, the second one finds substrings with an underscore followed by a number at the end of the string (That's what the '$' means) with an empty string.
Regex-related questions are always fun.
I'll throw one more to the mix. Here's str.extract
:
df['new_val'] = df['val'].str.extract('_(.+)_')
Output:
val new_val
0 1_2_3 2
1 1_2_3_4 2_3
2 1_2_3_4_5 2_3_4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.