简体   繁体   English

如何从熊猫数据框列中提取信息

[英]How to extract information from pandas dataframe column

I have the dataframe below and I want to extract some information from column A and then create other columns to add them based on their types.我有下面的数据框,我想从 A 列中提取一些信息,然后创建其他列以根据它们的类型添加它们。 Below is an example to illustrate this.下面是一个例子来说明这一点。

In [0]: df
Out[0]: 
          A                  
0 1258GA 25/01/20 TABLE 090626  038272
1 GOODIES 762088 A714816
2 TABLE AA88547 734963 GOODIES
3 WATER 02/450 FROM TOMORROW 48246
4 02H12 ALSCA 00548246B GOODIES

And I want to have the result below.我想得到下面的结果。

In [1]: df
Out[1]: 
          A                               Category             Date      Hour
0 1258GA 25/01/20 TABLE 090626  038272    TABLE           25/01/20
1 GOODIES 762088 A714816                  GOODIES 
2 TABLE AA88547 734963 GOODIES            TABLE GOODIES
3 WATER 02/450 FROM TOMORROW 48246        WATER 
4 02H12 ALSCA 00548246B GOODIES           GOODIES                        02H12

I've tried many things but haven't got that result我尝试了很多东西,但没有得到那个结果

Maybe this helps:也许这有帮助:

df['A'].str.findall(r'\b[A-Z]+\b').str.join(' ')

0                  TABLE
1                GOODIES
2          TABLE GOODIES
3    WATER FROM TOMORROW
4          ALSCA GOODIES

You can certainly do that using Series.str methods,你当然可以使用Series.str方法做到这Series.str

The Series.str.extract() returns: Series.str.extract()返回:

Extract capture groups in the regex pat as columns in a DataFrame.将正则表达式中的捕获组提取为 DataFrame 中的列。

For each subject string in the Series, extract groups from the first match of regular expression pat.对于系列中的每个主题字符串,从正则表达式 pat 的第一个匹配项中提取组。


The Series.str.findall() returns: Series.str.findall() 返回:

Find all occurrences of pattern or regular expression in the Series/Index.查找系列/索引中所有出现的模式或正则表达式。

Here is the code snippet,这是代码片段,

EDIT:编辑:

df["Category"] = df['A'].str.findall(r"(\b[A-Za-z]+\b)").str.join(' ')
df["Date"] = df['A'].str.extract(r"(\b[0-9]+/[0-9]+/[0-9]+\b)")
df["Hour"] = df['A'].str.extract(r"(\b[0-9]+H[0-9]+\b)")

And output will be,输出将是,

                                      A             Category      Date   Hour
0  1258GA 25/01/20 TABLE 090626  038272                TABLE  25/01/20    NaN
1                GOODIES 762088 A714816              GOODIES       NaN    NaN
2          TABLE AA88547 734963 GOODIES        TABLE GOODIES       NaN    NaN
3      WATER 02/450 FROM TOMORROW 48246  WATER FROM TOMORROW       NaN    NaN
4         02H12 ALSCA 00548246B GOODIES        ALSCA GOODIES       NaN  02H12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM