简体   繁体   中英

How to extract alphanumeric word from column values in excel with Python?

I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (ie A112233, A000023).

Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.

I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.

Excel范例

Suppose your df 's content construct from the following code:

import pandas as pd
df1=pd.DataFrame(
        {
            "columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
        }
)
print(df1)

Output:

                             columnA
0                            A194533
1          A4A556633 system01A484666
2                          A4A556633
3       a987654A948323a882332A484666
4                        A238B004867
5                     pageA000023lol
6                            a089923
7  something lol a484876A48466 emoji
8             A906633 A556633a556633

Now let's fetch the target corresponding to the regex patern:

result = df1['columnA'].str.extractall(r'([A]\d{6})')

Output:

               0
  match         
0 0      A194533
1 0      A556633
  1      A484666
2 0      A556633
3 0      A948323
  1      A484666
5 0      A000023
8 0      A906633
  1      A556633

And count them:

result.value_counts()

Output:

A556633    3
A484666    2
A000023    1
A194533    1
A906633    1
A948323    1
dtype: int64

Send the unique index into a list:

unique_list = [i[0] for i in result.value_counts().index.tolist()]

Output:

['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']

Value counts into a list:

unique_count_list = result.value_counts().values.tolist()

Output:

[3, 2, 1, 1, 1, 1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM