[英]Pandas merging columns that contain the same information, but slightly different column names
I have a collection of Excel spreadsheets from CMS (Medicare) that I want to analyze and have successfully imported them using pandas into a dataframe. 我要分析来自CMS(Medicare)的Excel电子表格,并已使用熊猫成功将其导入到数据框中。 Unfortunately, the column names are not uniform and many are similar, but vary due to random spaces, new lines, or extra information. 不幸的是,列名不是统一的,许多是相似的,但是由于随机空格,换行或其他信息而有所不同。 Example: 例:
OR 要么
I would simply change the names of the columns individually pandas: Merge two columns with different names? 我只是简单地分别更改熊猫的列名称:合并两个具有不同名称的列? , but I have over 350 columns and high probability that they column names will change in the future. ,但我有超过350列,并且它们的列名将来很有可能会更改。
Some ideas are to use regex to create cases to match names, but I am seeing it difficult to capture all cases and potential to run into new cases in the future. 有些想法是使用正则表达式来创建与名称匹配的案例,但我发现很难捕获所有案例并在将来遇到新案例的可能性。 Another idea is to use NLP to soft match columns. 另一个想法是使用NLP软匹配列。
Any suggestions or libraries? 有什么建议或图书馆吗? Thank you! 谢谢!
You can compare the similarities between strings using the difflib built in library: 您可以使用内置库中的difflib比较字符串之间的相似性:
from difflib import SequenceMatcher
def get_sim_ratio(x, y):
return SequenceMatcher(None, x, y).ratio()
print(get_sim_ratio('Vascular or Circulatory Disease', 'Vascular or Circulatory Disease (CC 104-106)'))
print(get_sim_ratio('Endocrine Disease', 'Vascular or Circulatory Disease (CC 104-106)'))
this outputs: 输出:
0.8266666666666667
0.36065573770491804
Using the output of that, you can set a certain level of sensitivity to merge the columns (ie if output > .5 -> merge) 使用输出,您可以设置某种程度的灵敏度以合并列(即,如果输出> .5-> merge)
If the columns are the same, but just labelled a bit differently, you can manually create a standard list of columns and set all the data frames to use those columns. 如果列相同,但标签稍有不同,则可以手动创建标准列列表,并设置所有数据框以使用这些列。 That is, column 1 is always some variation on 'ID Number' and column 2 is always some variation on 'Vascular or Circulatory Disease', but there are differences in coding it. 也就是说,第1列始终是“ ID号”的某种变体,而第2列总是是“血管或循环系统疾病”的某种变体,但编码方式有所不同。
data_frames = []
for file in files:
df = pd.read_excel(f)
df.columns = ['ID Number', 'Vascular or Circulatory Disease'] # and so forth
data_frames.append(df)
combined = pd.concat(data_frames)
And if you have a consistent set of columns except that some files have more at the end (eg a column was added or removed at some point): 并且,如果您有一组一致的列,但有些文件的末尾有更多列(例如,某个时候添加或删除了列):
def set_columns(data, columns):
if len(data.columns) < len(columns):
diff = len(data.columns) - len(columns)
data.columns = columns[:diff]
# Add missing columns
for i in range(diff, 0):
data[columns[i]] = np.nan
else:
data.columns = columns
return data
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.