简体   繁体   中英

Separate dots and text from numbers in python dataframe

I would like to separate dots and text from numbers in a dataframe.

The dataframe looks like this:

Net.Liq.37584957
Haircut48216354
Deficit10631397
             NaN
Haircutperassetclass
Equity31349682
Commodity12461964
FixedIncome663451
Currency3741257

Tried this: df.col.str.extract('([a-zA-Z]+)([^a-zA-Z]+)', expand=True), but the first row has both dots and text, hence it comes out in this way.

             0         1
0          Net         .
1      Haircut  48216354
2      Deficit  10631397
3          NaN       NaN
4          NaN       NaN
5       Equity  31349682
6    Commodity  12461964
7  FixedIncome    663451
8     Currency   3741257

How do i solve this?

Looks like you need pattern ([a-zA-Z.]+)(\\d+)?

Ex:

df = pd.DataFrame({"Col": ['Net.Liq.37584957', 'Haircut48216354', 'Deficit10631397', 'NaN', 'Haircutperassetclass', 'Equity31349682', 'Commodity12461964', 'FixedIncome663451', 'Currency3741257']})
df[['A', "B"]] = df['Col'].str.extract(r"([a-zA-Z.]+)(\d+)?", expand=True)
print(df)

Output:

                    Col                     A         B
0      Net.Liq.37584957              Net.Liq.  37584957
1       Haircut48216354               Haircut  48216354
2       Deficit10631397               Deficit  10631397
3                   NaN                   NaN       NaN
4  Haircutperassetclass  Haircutperassetclass       NaN
5        Equity31349682                Equity  31349682
6     Commodity12461964             Commodity  12461964
7     FixedIncome663451           FixedIncome    663451
8       Currency3741257              Currency   3741257

You may use

^(.*?)(?:\.?(\d+))?$

See the regex demo

Details

  • ^ - start of string
  • (.*?) - Group 1: any 0+ chars, as few as possible
  • (?:\\.?(\\d+))? - an optional sequence of:
    • \\.? - an optional dot
    • (\\d+) - Group 2: one or more digits
  • $ - end of string.

In code,

df[['A', 'B']] = df['Col'].str.extract(r'(.*?)(?:\.?(\d+))?$', expand=True)

Output:

>>> df
                    Col                     A         B
0      Net.Liq.37584957               Net.Liq  37584957
1       Haircut48216354               Haircut  48216354
2       Deficit10631397               Deficit  10631397
3                   NaN                   NaN       NaN
4  Haircutperassetclass  Haircutperassetclass       NaN
5        Equity31349682                Equity  31349682
6     Commodity12461964             Commodity  12461964
7     FixedIncome663451           FixedIncome    663451
8       Currency3741257              Currency   3741257

Assuming that the column of interest in the source DataFrame has name Txt , run:

df.Txt.str.extract(r'(?P<Letters>[a-z.]*)(?P<Digits>\d*)', flags=re.I)

( import re required).

The result for your data sample is:

                Letters    Digits
0              Net.Liq.  37584957
1               Haircut  48216354
2               Deficit  10631397
3                   NaN       NaN
4  Haircutperassetclass          
5                Equity  31349682
6             Commodity  12461964
7           FixedIncome    663451
8              Currency   3741257

Note: The first column has name Letters , but you wrote that you want to separate:

  • dots and text (actually letters ),
  • from digits ,

so this column contains actually letters and dots .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM