简体   繁体   English

提取 python dataframe 中关键字后空格之间的所有字符

[英]Extract all the character between space after a keyword in python dataframe

I have a dataframe with a string contains "order".我有一个 dataframe 字符串包含“订单”。 The order doesn't follow any format as it contains special character.该订单不遵循任何格式,因为它包含特殊字符。 I want to extract order number with special character with it.我想用它提取带有特殊字符的订单号。 The idea is to extract all the character after the keyword "order" till the next space in one query.这个想法是提取关键字“order”之后的所有字符,直到一个查询中的下一个空格。

      message                                  Model
order 769707-134432 has reached EARLY.       LG
Delivered : order 1765412456                 Samsung
No New Updates : order RS1765123404          Sony
order #769707-41213-4355 is EARLY            Dell
No New Updates : order 3FA1765404            Samsung
order #76923407 has reached EARLY            LG
No New Updates : order R-176543123           Sony
Recheduled : order 100251283_415731301       Sony
order #9T_0312330 delivered                  Dell
order #000090223532 has arrived at pickup.   LG

I required output should be我要求 output 应该是

   message                                   order               Model
order 769707-134432 has reached EARLY       769707-134432        LG
Delivered : order 1765412456                1765412456           Samsung
No New Updates : order RS1765123404         RS1765123404         Sony
order #769707-41213-4355 is EARLY           769707-41213-4355    Dell
No New Updates : order 3FA1765404           3FA1765404           Samsung
order #76923407 has reached EARLY           76923407             LG
No New Updates : order R-176543123          R-176543123          Sony
Recheduled : order 100251283_415731301      1002283_4157301      Sony
order #9T_0312330 delivered                 9T_0312330           Dell
order #000090223532 has arrived at pickup   000090223532         LG

When I tried using Regex, I am getting #000090223532 has , 769707- , 3FA当我尝试使用正则表达式时,我得到#000090223532 has , 769707- , 3FA

Using str.replace we can try:使用str.replace我们可以尝试:

data["order"]= data["message"].str.replace("^.*\border #?(\S+)\b.*$", "\1") 

In my opinion the cleanest way would be to use str.extract :在我看来,最干净的方法是使用str.extract

import pandas as pd

df = pd.DataFrame(dct)
df['order'] = df['message'].str.extract(r'order\s+\#?(\S+)')
print(df)

This yields这产生

                                      message    model                order
0      order 769707-134432 has reached EARLY.       LG        769707-134432
1                Delivered : order 1765412456  Samsung           1765412456
2         No New Updates : order RS1765123404     Sony         RS1765123404
3           order #769707-41213-4355 is EARLY     Dell    769707-41213-4355
4           No New Updates : order 3FA1765404  Samsung           3FA1765404
5           order #76923407 has reached EARLY       LG             76923407
6          No New Updates : order R-176543123     Sony          R-176543123
7      Recheduled : order 100251283_415731301     Sony  100251283_415731301
8                 order #9T_0312330 delivered     Dell           9T_0312330
9  order #000090223532 has arrived at pickup.       LG         000090223532

See a demo for the expression on regex101.com .请参阅regex101.com上的表达式演示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM