简体   繁体   English

将字符串列转换为 pandas dataframe 中的固定长度字符串数组

[英]Convert string column to array of fixed length strings in pandas dataframe

I have a pandas dataframe with a few columns.我有一个 pandas dataframe 有几列。 I want to convert one of the string columns into an array of strings with fixed length.我想将其中一个字符串列转换为具有固定长度的字符串数组。

Here is how current table looks like:这是当前表的样子:

+-----+--------------------+--------------------+
|col1 |         col2       |         col3       |
+-----+--------------------+--------------------+
|   1 |Marco               | LITMATPHY          |
|   2 |Lucy                | NaN                |
|   3 |Andy                | CHMHISENGSTA       |
|   4 |Nancy               | COMFRNPSYGEO       |
|   5 |Fred                | BIOLIT             |
+-----+--------------------+--------------------+

How can I split string of "col 3" into array of string of length 3 as follows: PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.如何将“col 3”字符串拆分为长度为 3 的字符串数组,如下所示: PS:col 3 中可以有空格或 NaN,应将它们替换为空数组。

+-----+--------------------+----------------------------+
|col1 |         col2       |         col3               |
+-----+--------------------+----------------------------+
|   1 |Marco               | ['LIT','MAT','PHY]         |
|   2 |Lucy                | []                         |
|   3 |Andy                | ['CHM','HIS','ENG','STA']  |
|   4 |Nancy               | ['COM','FRN','PSY','GEO']  |
|   5 |Fred                | ['BIO','LIT']              |
+-----+--------------------+----------------------------+

Use textwrap.wrap :使用textwrap.wrap

import textwrap

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])

If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last.如果存在长度不是 3 的倍数的字符串,则将剩余的字母推到最后。 If you only want to have strings of lenght 3, you can apply one more to get rid of those strings:如果您只想拥有长度为 3 的字符串,则可以再apply一个来摆脱这些字符串:

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
           apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)

Another way can be this;另一种方法可以是这样;

import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})

def split_str(s):
    lst=[]
    for i in range(0,len(s),3):
        lst.append(s[i:i+3])
    return lst

df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))

# Output

           col3           col3_result
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2  CHMHISENGSTA  [CHM, HIS, ENG, STA]
3  COMFRNPSYGEO  [COM, FRN, PSY, GEO]
4        BIOLIT            [BIO, LIT]

With only using Pandas we can do:仅使用 Pandas 我们可以做到:

df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])

def to_list(string, n):
    if string != string: # True if string = np.nan
        lst = []
    else:
        lst = [string[i:i+n] for i in range(0, len(string), n)]
    return lst

df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))

Output: Output:

           col3              new_col3
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2                                  []
3     CHFDIOSFF       [CHF, DIO, SFF]
4        CHFIOD            [CHF, IOD]
5  FHDIFOSDFJKL  [FHD, IFO, SDF, JKL]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM