[英]Convert string column to array of fixed length strings in pandas dataframe
I have a pandas dataframe with a few columns.我有一个 pandas dataframe 有几列。 I want to convert one of the string columns into an array of strings with fixed length.我想将其中一个字符串列转换为具有固定长度的字符串数组。
Here is how current table looks like:这是当前表的样子:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |Marco | LITMATPHY |
| 2 |Lucy | NaN |
| 3 |Andy | CHMHISENGSTA |
| 4 |Nancy | COMFRNPSYGEO |
| 5 |Fred | BIOLIT |
+-----+--------------------+--------------------+
How can I split string of "col 3" into array of string of length 3 as follows: PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.如何将“col 3”字符串拆分为长度为 3 的字符串数组,如下所示: PS:col 3 中可以有空格或 NaN,应将它们替换为空数组。
+-----+--------------------+----------------------------+
|col1 | col2 | col3 |
+-----+--------------------+----------------------------+
| 1 |Marco | ['LIT','MAT','PHY] |
| 2 |Lucy | [] |
| 3 |Andy | ['CHM','HIS','ENG','STA'] |
| 4 |Nancy | ['COM','FRN','PSY','GEO'] |
| 5 |Fred | ['BIO','LIT'] |
+-----+--------------------+----------------------------+
Use textwrap.wrap
:使用textwrap.wrap
:
import textwrap
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])
If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last.如果存在长度不是 3 的倍数的字符串,则将剩余的字母推到最后。 If you only want to have strings of lenght 3, you can apply
one more to get rid of those strings:如果您只想拥有长度为 3 的字符串,则可以再apply
一个来摆脱这些字符串:
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)
Another way can be this;另一种方法可以是这样;
import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})
def split_str(s):
lst=[]
for i in range(0,len(s),3):
lst.append(s[i:i+3])
return lst
df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))
# Output
col3 col3_result
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 CHMHISENGSTA [CHM, HIS, ENG, STA]
3 COMFRNPSYGEO [COM, FRN, PSY, GEO]
4 BIOLIT [BIO, LIT]
With only using Pandas we can do:仅使用 Pandas 我们可以做到:
df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])
def to_list(string, n):
if string != string: # True if string = np.nan
lst = []
else:
lst = [string[i:i+n] for i in range(0, len(string), n)]
return lst
df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))
Output: Output:
col3 new_col3
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 []
3 CHFDIOSFF [CHF, DIO, SFF]
4 CHFIOD [CHF, IOD]
5 FHDIFOSDFJKL [FHD, IFO, SDF, JKL]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.