简体   繁体   English

如何在熊猫中多次提取字符串中存在的单个模式

[英]How to extract a single pattern present in a string more than once in pandas

I have following data 我有以下数据

Description
4 GB+ 2 GB Night 3G/2G Data
Unlimited Local & STD Calls + 500 MB 3G/2G Data (T&C apply)
9GB + 8GB night data+ 6GB next night data
8 GB data 4G

What I want is to extract amount of data(4GB etc) and merge them in a single column 我想要的是提取数据量(4GB等)并将其合并到单个列中

df2=df['Description'].str.extract('([0-9]+(\.[0-9][0-9]?)?\s?GB|[0-9]+(\.[0-9][0-9]?)?\s?MB)')

I have used pandas function extractall() too but both extract and extractall() giving me result like this 我也使用了熊猫函数extractall()但是extractextractall()都给了我这样的结果

0     1    2
4GB   Nan  Nan     #2 gb is missing
500MB Nan  Nan   
9GB   Nan  Nan     # 8gb 6 gb is missing
8Gb   Nan  Nan

Where i am wrong? 我哪里错了? Also when combining the rows with df.fillna(' ') i am getting an output like this 而且当将行与df.fillna(' ')合并时,我得到这样的输出

     0 
    4GB,2GB, 
    500MB, , 
    9GB,8GB,6GB
    8GB, , 

though what i want is 虽然我想要的是

0
4GB,2GB
500MB
9GB,8GB,6GB
8GB

I dont want spaces.Is there any way in pandas to get the data in the above format? 我不想要空格。熊猫有没有办法以上述格式获取数据? I am a beginner in python,don't know how to achieve this.If there is another way please mention. 我是python的初学者,不知道如何实现。如果还有其他方法,请提及。

EDIT: 编辑:

this the full code: 这是完整的代码:

df2=df['Description'].str.extractall('([0-9]+(\.[0-9][0-9]?)?\s?GB|[0-     9]+(\.[0-9][0-9]?)?\s?MB)')
#print df2
df2[1].fillna("",inplace=True);
df2[2].fillna("",inplace=True)
print df2
df3=df2[0]+','+df2[1]+','+df2[2];
print df3

Using extractall should work like below: 使用extractall应该如下所示:

df.Description.str\
  .extractall('(\d*\s?[GM]B)').groupby(level=0)\
  .apply(lambda x: ','.join(x[0])\
  .replace(' ',''))
Out[75]: 
0        4GB,2GB
1          500MB
2    9GB,8GB,6GB
3            8GB
dtype: object
df2=df['Description'].str.extractall('(\d*\.\d+|\d+\s?GB|\d*\.\d+|\d+\s?MB)').reset_index()
df2 = pd.pivot_table(df2, index='level_0', columns="match", values=0, aggfunc='last').reset_index(drop=True)
df2 = df2.apply(lambda row: ','.join(row.dropna()), axis=1)

Try this code for your expected output. 尝试使用此代码以获得预期的输出。

You can get the data as columns like this 您可以像这样以列的形式获取数据

df2=df['Description'].str.extractall('([0-9]+(\.[0-9][0-9]?)?\s?GB|[0-9]+(\.[0-9][0-9]?)?\s?MB)')
df2.reset_index().groupby('match')[0].apply(lambda x: "{%s}" % ', '.join(x)).apply(lambda x:x.replace(" ",""))

output : 输出:

match
0    {4GB,500MB,9GB,8GB}
1              {2GB,8GB}
2                  {6GB}
Name: 0, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 当一个组的字符串在列中出现多次时删除重复项-pandas - Drop duplicates when for a group a string present more than once in a column-pandas 替换一次多次出现的字符串中的模式 - Replace a pattern in a string once which occurs more than once 如何在pandas中的字符串模式后提取数字 - How to extract numbers after string pattern in pandas 如何知道一个字符串是否包含比字母模式更多的数字模式? - How to know if a string contains more numeric pattern than alphabetic pattern? 如何在字符串内多次出现的字符上拆分字符串 - How to split a string on a character that occurs more than once inside the string 如果我们多次写入字符串,append 是如何工作的 - How append works for a string if we write it more than once 在使用for循环时,如何多次string.replace()? - How to string.replace() more than once, while using a for loop? 如何检查某个字符串是否在列表中重复多次 - How to check if a certain string repeats more than once in a list 如何从字符串中删除多次出现的字符? - How to remove characters that appear more than once from a string? 如何使用python中的tabula提取pdf文件中存在的多个表? - How to extract more than one table present in a pdf file with tabula in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM