[英]create new df from existing df in pandas - python
What should be the optimized pandas command to create a new data frame from existing data frame that have only 1 column named val with the following transformation.什么应该是优化的 pandas 命令从现有数据帧创建一个新数据帧,该数据帧只有 1 个名为val的列,并进行以下转换。
Input:输入:
1_2_3
1_2_3_4
1_2_3_4_5
Output: Output:
2
2_3
2_3_4
Remove everything till first underscore (including _) and also remove everything after last _ (including _)删除直到第一个下划线(包括 _)的所有内容,并删除最后一个 _ 之后的所有内容(包括 _)
You can use str.replace
with a regex that matches characters up to and including the first _
and from the last _
to the end of string, replacing both those parts with nothing:您可以将
str.replace
与匹配字符的正则表达式一起使用,直到并包括第一个_
和从最后一个_
到字符串的末尾,将这两个部分都替换为空:
df['val'] = df['val'].str.replace('^[^_]*_(.*)_[^_]*$', r'\1')
Output: Output:
val
0 2
1 2_3
2 2_3_4
If you want that single column in a new dataframe, you can convert it to one using to_frame
:如果您想要新 dataframe 中的单列,您可以使用
to_frame
将其转换为一列:
df2 = df['val'].str.replace('^[^_]*_(.*)_[^_]*$', r'\1').to_frame()
Another way with str slicing after split:拆分后 str 切片的另一种方法:
df['val'].str.split("_").str[1:-1].str.join("_")
0 2
1 2_3
2 2_3_4
Split the string by the charcters between start of string r1 and r2 end of string按字符串开头 r1 和字符串结尾 r2 之间的字符拆分字符串
where r1=digit_
and r2=_digit
其中
r1=digit_
和r2=_digit
df.a.str.split('(?<=^\d\_)(.*?)(?=\_\d+$)').str[1]
You can find the first and the last _
using str.find
and str.rfind
and then you can get the substring from it.您可以使用
str.find
和str.rfind
找到第一个和最后一个_
,然后您可以从中获取 substring。
df['val'] = [x[x.find('_')+1:x.rfind('_')] for x in df['val']]
Output: Output:
val
0 2
1 2_3
2 2_3_4
You can do it using the replace method您可以使用替换方法来做到这一点
df.vals = df.vals.str.replace(r'^1_', '').str.replace(r'_\d$', '')
I'm passing 2 regex, first one finds the substring 1_ and replaces it with empty string, the second one finds substrings with an underscore followed by a number at the end of the string (That's what the '$' means) with an empty string.我正在传递 2 个正则表达式,第一个找到 substring 1_ 并将其替换为空字符串,第二个找到带有下划线后跟数字的子字符串(这就是 '$' 的含义)与一个空细绳。
Regex-related questions are always fun.与正则表达式相关的问题总是很有趣。
I'll throw one more to the mix.我会再扔一个。 Here's
str.extract
:这是
str.extract
:
df['new_val'] = df['val'].str.extract('_(.+)_')
Output: Output:
val new_val
0 1_2_3 2
1 1_2_3_4 2_3
2 1_2_3_4_5 2_3_4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.