简体   繁体   中英

Extract Digits from Pandas column (Object dtype)

I'm having trouble removing non-digits from a df column. I have tried a few methods, but there are still quite a few that produce NaN values when the function passed through the column.

I need the output to be only digits in an integer form (No leading zeros)

Cust #
0   10726
2   11699
5   12963
8   z13307
9   13405
12  14831-001
16  16416
17  16917
18  18027
24  19233z
dtype('O')

I have tried:

Unique_Stores['Cust #2']=Unique_Stores['Cust #2'].str.extract('(\d+)',expand=True)

Unique_Stores['Cust #2'].str.replace(r'(\D+)','')

Unique_Stores['Cust #2'].replace(to_replace="([0-9]+)", value=r"\1", regex=True, inplace=True)

Unique_Stores['Cust #2'] = pd.to_numeric(Unique_Stores['Cust #2'].str.replace(r'\D+', ''), errors='coerce')

Thank you in advance, and let me know if you need more info.

But no matter what I do, the first 1000 or so lines return NaN values- even when the value is an integer.

Link to actual dataset

UPDATE:

In [144]: df = pd.read_csv(r'D:\download\Customer_Numbers.csv', index_col=0)

In [145]: df['Cust #2'] = df['Cust #'].str.replace(r'\D+', '').astype(int)

In [146]: df
Out[146]:
      State Zip Code      Cust #  Cust #2
0        PA    16505       10726    10726
2        MI    48103       11699    11699
5        NH     3253       12963    12963
8        PA    18951       13307    13307
9        MA     2360       13405    13405
12       NY    11940       14831    14831
16       OH    44278       16416    16416
17       OH    45459       16917    16917
18       MA     1748       18027    18027
24       NY    14226       19233    19233
...     ...      ...         ...      ...
54393    WA    99207  005611-99    561199
54394    WA    99006        7775     7775
54395    WA    99353        8006     8006
54399    WA    99206        8888     8888
54404    CA    92117      444202   444202
54408    CA    90019       30066    30066
54411    CA    90026      443607   443607
54414    CA    90094        9242     9242
54417    CA    90405        9245     9245
54420    CA    90038        9247     9247

[6492 rows x 4 columns]

In [147]: df.dtypes
Out[147]:
State       object
Zip Code    object
Cust #      object
Cust #2      int32
dtype: object

OLD answer:

In [123]: df
Out[123]:
          val
0       10726
2       11699
5       12963
8      z13307
9       13405
12  14831-001
16      16416
17      16917
18      18027
24     19233z

In [124]: df['val'] = pd.to_numeric(df['val'].str.replace(r'\D+', ''), errors='coerce')

In [125]: df
Out[125]:
         val
0      10726
2      11699
5      12963
8      13307
9      13405
12  14831001
16     16416
17     16917
18     18027
24     19233

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM