简体   繁体   English

大熊猫合并包含相同信息但列名稍有不同的列

[英]Pandas merging columns that contain the same information, but slightly different column names

I have a collection of Excel spreadsheets from CMS (Medicare) that I want to analyze and have successfully imported them using pandas into a dataframe. 我要分析来自CMS(Medicare)的Excel电子表格,并已使用熊猫成功将其导入到数据框中。 Unfortunately, the column names are not uniform and many are similar, but vary due to random spaces, new lines, or extra information. 不幸的是,列名不是统一的,许多是相似的,但是由于随机空格,换行或其他信息而有所不同。 Example: 例:

  • 'Vascular or Circulatory Disease' “血管或循环系统疾病”
  • 'Vascular or Circulatory Disease (CC 104-106)' “血管或循环系统疾病(CC​​ 104-106)”
  • 'Vascular or Circulatory Disease ' “血管或循环系统疾病”

OR 要么

  • 'ID\\nNumber' 'ID \\ nNumber'
  • 'ID \\nNumber' 'ID \\ nNumber'
  • 'ID Number' '身份证号'

I would simply change the names of the columns individually pandas: Merge two columns with different names? 我只是简单地分别更改熊猫的列名称:合并两个具有不同名称的列? , but I have over 350 columns and high probability that they column names will change in the future. ,但我有超过350列,并且它们的列名将来很有可能会更改。

Some ideas are to use regex to create cases to match names, but I am seeing it difficult to capture all cases and potential to run into new cases in the future. 有些想法是使用正则表达式来创建与名称匹配的案例,但我发现很难捕获所有案例并在将来遇到新案例的可能性。 Another idea is to use NLP to soft match columns. 另一个想法是使用NLP软匹配列。

Any suggestions or libraries? 有什么建议或图书馆吗? Thank you! 谢谢!

You can compare the similarities between strings using the difflib built in library: 您可以使用内置库中的difflib比较字符串之间的相似性:

from difflib import SequenceMatcher

def get_sim_ratio(x, y):
    return SequenceMatcher(None, x, y).ratio()

print(get_sim_ratio('Vascular or Circulatory Disease', 'Vascular or Circulatory Disease (CC 104-106)'))
print(get_sim_ratio('Endocrine Disease', 'Vascular or Circulatory Disease (CC 104-106)'))

this outputs: 输出:

0.8266666666666667
0.36065573770491804

Using the output of that, you can set a certain level of sensitivity to merge the columns (ie if output > .5 -> merge) 使用输出,您可以设置某种程度的灵敏度以合并列(即,如果输出> .5-> merge)

If the columns are the same, but just labelled a bit differently, you can manually create a standard list of columns and set all the data frames to use those columns. 如果列相同,但标签稍有不同,则可以手动创建标准列列表,并设置所有数据框以使用这些列。 That is, column 1 is always some variation on 'ID Number' and column 2 is always some variation on 'Vascular or Circulatory Disease', but there are differences in coding it. 也就是说,第1列始终是“ ID号”的某种变体,而第2列总是是“血管或循环系统疾病”的某种变体,但编码方式有所不同。

data_frames = []
for file in files:
   df = pd.read_excel(f)
   df.columns = ['ID Number', 'Vascular or Circulatory Disease'] # and so forth
   data_frames.append(df)

combined = pd.concat(data_frames)

And if you have a consistent set of columns except that some files have more at the end (eg a column was added or removed at some point): 并且,如果您有一组一致的列,但有些文件的末尾有更多列(例如,某个时候添加或删除了列):

def set_columns(data, columns):
    if len(data.columns) < len(columns):
        diff = len(data.columns) - len(columns)
        data.columns = columns[:diff]
        # Add missing columns
        for i in range(diff, 0):
            data[columns[i]] = np.nan
    else:
        data.columns = columns
    return data

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 合并两个具有相同列名但在pandas中具有不同列数的数据帧 - Merging two dataframes with same column names but different number of columns in pandas Pandas - 合并两个索引名称不同但列数相同的数据框 - Pandas - Merging Two Data frames with different index names but same amount of Columns 合并 pandas 中两个不具有相同列名且长度不同的数据框 - Merging two data frames in pandas that don't have the same column names and are different lengths 如何将多个原始输入 CSV 与包含名称略有不同的相似列的 pandas 合并? - How to merge multiple raw input CSV's with pandas containing similar columns with slightly different names? 在同一个Pandas DataFrame中的一个新列中合并几个列 - Merging several columns in one new column in the same pandas DataFrame Python>Pandas>对具有相同列名、相同索引值但索引长度不同的不同数据框中的列求和 - Python>Pandas>Summing columns in different data frames which have same column names, same index values but not same same length of index 组合包含一些相同和不同列名的数据框字典? - Combine dictionary of dataframes that contain some of the same and different column names? 合并具有相同列的熊猫数据框 - Merging pandas dataframes with same columns 合并pandas中相同dataframe中的列 - Merging columns in the same dataframe in pandas Pandas 在不同的列上合并 DF - Pandas merging DF on different columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM