简体   繁体   English

从熊猫数据框中提取字符串

[英]Extracting strings from a pandas dataframe

I'm here again hoping to find a solution to my coding nightmare. 我再次在这里希望找到解决我的编码梦night的方法。 I have a dictionary term_dict with list of terms as keys and term category as values. 我有一本字典term_dict ,其中术语列表作为键,术语类别作为值。 And a dataframe data with ID and Notes columns. 并带有ID和Notes列的数据框data The task is to find matching terms in the data.Notes using the term_dict for every data.ID record. 任务是找到匹配的条件data.Notes使用term_dict为每data.ID记录。

  term_dict{                   
  Ibuprofen 800mg     :     Drug
  Hip Replacement Surgery : Treatment
  Tylenol   AM          :   Drug
  Mild Dislocation      :   Treatment
  Advil                  :  Drug
  Fractured Tibia       :   Treatment
  Quinone               :   Drug
  Fever                 :   Treatment
  Penicillin 250mg      :   Drug
  Histerectomy          :   Treatment
  Surgical removal of bunion :  Treatment
  Therapy               :   Treatment
  Bunion                :   Treatment
  Hospita X             :   Location
  mg                    :   Dosage
  stop                  :   Exclusion
}

data:
ID      Notes                        
604     Take 2 tablets of advil & 3 caps of pen   
        250mg twice daily                          
602     Stop pen but cont. with advil
        as needed for the fracture
210     2 tabs of Tyl 3x daily for 5 days         
607     nan
700     surgery scheduled for 01/01/2017
515     nan                                       
019     Call my office if bunion pain persist     
        after 3 days
604     f/up appt. @Hospital X 

So far, this is my code: 到目前为止,这是我的代码:

lists = []
for s in data['Notes']:
    cleanNotes = " " + " ".join(re.split(r'[^a-z 0-9]|[w/]',s.lower())) + " "
    for k, v in term_dict.items():
        k = " %s "%k
        if k in cleanNotes and v != exclusion:
            if k in cleanNotes and v == 'drug':
                lists.append(k)
                data['Drug'] = ':'.join(str(lists))
            elif k in cleanNotes and v == 'location':
                lists.append(k)
                data['Location'] = ' '.join(str(lists))
            elif k in cleanNotes and v == 'treatment':
                lists.append(k)
                data['Treatment'] = ':'.join(str(lists))
            elif k in cleanNotes and v == 'dosage':
                lists.append(k)
                data['Dosage'] = ':'.join(str(lists))
        else:
            for s in data.Notes:
            matches = list(datefinder.find_dates(s.lower()))
            data['Date'] = ', '.join([str(dates) for dates in matches])

....and my output is not what's expected because the code just populates the new columns of the dataframe with matches from he last record of the dataframe: ....我的输出与预期的不一样,因为代码只是使用数据框最后一条记录中的匹配项填充数据框的新列:

data:
ID      Notes                                       Drug                Dosage      Location        Treatment   Date                
604     Take 2 tablets of advil & 3 caps of pen     advil                           Hospital X
        250mg twice daily                          
602     Stop pen but cont. with advil               advil                           Hospital X  
        as needed for the fracture
210     2 tabs of Tyl 3x daily for 5 days           advil    
607     nan                                         advil
700     surgery scheduled for 01/01/2017            advil                                               
515     nan                                         advil
019     Call my office if bunion pain persist       advil                                               
        after 3 days
604     f/up appt. @Hospital X. cont w/advil        advil                           Hospital X

***But expected Output: ***但预期输出:

data:
ID      Notes                                       Drug                Dosage      Location        Treatment   Date                
604     Take 2 tablets of advil & 3 caps of pen     advil:penicilin     0:250mg
        250mg twice daily                          
602     Stop pen but cont. with advil               advil                                           fracture
        as needed for the fracture
210     2 tabs of Tyl 3x daily for 5 days           Tylenol
607     nan
700     surgery scheduled for 01/01/2017                                                            surgery     01/01/2017
515     nan                                       
019     Call my office if bunion pain persist                                                       bunion
        after 3 days
604     f/up appt. @Hospital X. cont w/advil        advil                           Hospital X

I'd be more than grateful if I can get this duplication fixed. 如果能解决此重复问题,我将不胜感激。 Thanks! 谢谢!

The essence of your error is this. 您的错误的本质是这样。 You assign every element of that column to the same value: 您可以将该列的每个元素分配给相同的值:

In [114]: import pandas as pd

In [115]: df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))

In [116]: df.head()
Out[116]: 
          A         B         C         D
0 -0.896291 -0.277551  0.926559  0.522212
1 -0.265559 -1.300435 -0.079514 -1.083569
2 -0.534509  0.298264 -1.361829  0.750666
3  0.318937 -0.407164  0.080020  0.499435
4 -0.161574 -1.012471  0.631092  1.368540

In [117]: df['NewCol'] = 'something here'

In [119]: df.head()
Out[119]: 
          A         B         C         D          NewCol
0 -0.896291 -0.277551  0.926559  0.522212  something here
1 -0.265559 -1.300435 -0.079514 -1.083569  something here
2 -0.534509  0.298264 -1.361829  0.750666  something here
3  0.318937 -0.407164  0.080020  0.499435  something here
4 -0.161574 -1.012471  0.631092  1.368540  something here

To fix this, what you can do is create empty columns up front, like this: 要解决此问题,您可以做的是在前面创建空列,如下所示:

In [120]: df = pd.DataFrame(np.random.randn(50, 1), columns=['Notes'])

In [121]: df['Drug'] = ""
     ...: df['Location'] = ""
     ...: df['Treatment'] = ""
     ...: df['Dosage'] = ""
     ...: 

In [122]: df.head()
Out[122]: 
      Notes Drug Location Treatment Dosage
0  0.325993                               
1 -0.561066                               
2  0.555040                               
3  0.001332                               
4  0.400009  

When looping over notes, use an enumerated loop: 循环注释时,请使用枚举循环:

for i, s in enumerate(data['Notes']):

Then, when the need arises, just set that cell appropriately: 然后,在需要时,只需适当地设置该单元格即可:

df.set_value(i, 'Drug', 'advil')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM