将字符串列转换为 pandas dataframe 中的固定长度字符串数组

Question

I have a pandas dataframe with a few columns.我有一个 pandas dataframe 有几列。 I want to convert one of the string columns into an array of strings with fixed length.我想将其中一个字符串列转换为具有固定长度的字符串数组。

Here is how current table looks like:这是当前表的样子：

+-----+--------------------+--------------------+
|col1 |         col2       |         col3       |
+-----+--------------------+--------------------+
|   1 |Marco               | LITMATPHY          |
|   2 |Lucy                | NaN                |
|   3 |Andy                | CHMHISENGSTA       |
|   4 |Nancy               | COMFRNPSYGEO       |
|   5 |Fred                | BIOLIT             |
+-----+--------------------+--------------------+

How can I split string of "col 3" into array of string of length 3 as follows: PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.如何将“col 3”字符串拆分为长度为 3 的字符串数组，如下所示： PS：col 3 中可以有空格或 NaN，应将它们替换为空数组。

+-----+--------------------+----------------------------+
|col1 |         col2       |         col3               |
+-----+--------------------+----------------------------+
|   1 |Marco               | ['LIT','MAT','PHY]         |
|   2 |Lucy                | []                         |
|   3 |Andy                | ['CHM','HIS','ENG','STA']  |
|   4 |Nancy               | ['COM','FRN','PSY','GEO']  |
|   5 |Fred                | ['BIO','LIT']              |
+-----+--------------------+----------------------------+

Answer 1

Use textwrap.wrap :使用textwrap.wrap ：

import textwrap

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])

If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last.如果存在长度不是 3 的倍数的字符串，则将剩余的字母推到最后。 If you only want to have strings of lenght 3, you can apply one more to get rid of those strings:如果您只想拥有长度为 3 的字符串，则可以再apply一个来摆脱这些字符串：

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
           apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)

Answer 2

Another way can be this;另一种方法可以是这样；

import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})

def split_str(s):
    lst=[]
    for i in range(0,len(s),3):
        lst.append(s[i:i+3])
    return lst

df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))

# Output

           col3           col3_result
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2  CHMHISENGSTA  [CHM, HIS, ENG, STA]
3  COMFRNPSYGEO  [COM, FRN, PSY, GEO]
4        BIOLIT            [BIO, LIT]

Answer 3

With only using Pandas we can do:仅使用 Pandas 我们可以做到：

df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])

def to_list(string, n):
    if string != string: # True if string = np.nan
        lst = []
    else:
        lst = [string[i:i+n] for i in range(0, len(string), n)]
    return lst

df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))

Output: Output：

           col3              new_col3
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2                                  []
3     CHFDIOSFF       [CHF, DIO, SFF]
4        CHFIOD            [CHF, IOD]
5  FHDIFOSDFJKL  [FHD, IFO, SDF, JKL]

将字符串列转换为 pandas dataframe 中的固定长度字符串数组

问题描述

3 个解决方案

解决方案1
4 已采纳 2022-09-26 10:17:30

解决方案2
2 2022-09-26 10:19:59

解决方案3
1 2022-09-26 10:21:51

将字符串列转换为 pandas dataframe 中的固定长度字符串数组

问题描述

3 个解决方案

解决方案1 4 已采纳 2022-09-26 10:17:30

解决方案2 2 2022-09-26 10:19:59

解决方案3 1 2022-09-26 10:21:51

解决方案1
4 已采纳 2022-09-26 10:17:30

解决方案2
2 2022-09-26 10:19:59

解决方案3
1 2022-09-26 10:21:51