简体   繁体   English

如何将多个.xls 文件与 python 中的超链接合并?

[英]How to merge multiple .xls files with hyperlinks in python?

I am trying to merge multiple.xls files that have many columns, but 1 column with hyperlinks.我正在尝试合并具有很多列的 multiple.xls 文件,但 1 列带有超链接。 I try to do this with Python but keep running into unsolvable errors.我尝试使用 Python 执行此操作,但一直遇到无法解决的错误。

Just to be concise, the hyperlinks are hidden under a text section.简而言之,超链接隐藏在文本部分下。 The following ctrl-click hyperlink is an example of what I encounter in the.xls files: ES2866911 (T3) .以下 ctrl-click 超链接是我在 .xls 文件中遇到的示例: ES2866911 (T3)

In order to improve reproducibility, I have added.xls1 and.xls2 samples below.为了提高重现性,我在下面添加了.xls1 和.xls2 示例。

xls1: xls1:

Title标题 Publication_Number Publication_Number
P_A P_A ES2866911 (T3) ES2866911 (T3)
P_B P_B EP3887362 (A1) EP3887362 (A1)

.xls2: .xls2:

Title标题 Publication_Number Publication_Number
P_C个人电脑 AR118706 (A2) AR118706 (A2)
P_D P_D ES2867600 (T3) ES2867600 (T3)

Desired outcome:期望的结果:

Title标题 Publication_Number Publication_Number
P_A P_A ES2866911 (T3) ES2866911 (T3)
P_B P_B EP3887362 (A1) EP3887362 (A1)
P_C个人电脑 AR118706 (A2) AR118706 (A2)
P_D P_D ES2867600 (T3) ES2867600 (T3)

I am unable to get.xls file into Python without losing formatting or losing hyperlinks.我无法在不丢失格式或丢失超链接的情况下将.xls 文件放入 Python。 In addition I am unable to convert.xls files to.xlsx.此外,我无法将.xls 文件转换为.xlsx。 I have no possibility to acquire the.xls files in.xlsx format.我无法获取 .xlsx 格式的 .xls 文件。 Below I briefly summarize what I have tried:下面我简要总结一下我的尝试:

1.) Reading with pandas was my first attempt. 1.) 阅读 pandas 是我的第一次尝试。 Easy to do, but all hyperlinks are lost in PD, furthermore all formatting from original file is lost.很容易做到,但是 PD 中的所有超链接都丢失了,而且原始文件中的所有格式都丢失了。

2.) Reading.xls files with openpyxl.load 2.) 使用 openpyxl.load 读取.xls 文件

InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.

3.) Converting.xls files to.xlsx 3.) 将.xls 文件转换为.xlsx

from xls2xlsx import XLS2XLSX
x2x = XLS2XLSX(input.file.xls)
wb = x2x.to_xlsx()
x2x.to_xlsx('output_file.xlsx')
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
import pyexcel as p
p.save_book_as(file_name=input_file.xls, dest_file_name=export_file.xlsx)
TypeError: got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element
During handling of the above exception, another exception occurred:
StopIteration

4.) Even if we are able to read the.xls file with xlrd for example (meaning we will never be able to save the file as.xlsx, I can't even see the hyperlink: 4.)即使我们能够使用 xlrd 读取 .xls 文件(这意味着我们永远无法将文件另存为 .xlsx,我什至看不到超链接:

import xlrd
wb = xlrd.open_workbook(file) # where vis.xls is your test file
ws = wb.sheet_by_name('Sheet1')
ws.cell(5, 1).value   
'AR118706 (A2)' #Which is the name, not hyperlink

5.) I tried installing older versions of openpyxl==3.0.1 to overcome type error to no succes. 5.) 我尝试安装旧版本的 openpyxl==3.0.1 来克服类型错误,但没有成功。 I tried to open.xls file with openpyxl with xlrd engine, similar typerror "xml.entree.elementtree.element' error occured. I tried many ways to batch convert.xls files to.xlsx all with similar errors.我尝试使用带有 xlrd 引擎的 openpyxl 打开.xls 文件,发生了类似的 typerror “xml.entree.elementtree.element”错误。我尝试了多种方法将.xls 文件批量转换为.xlsx,所有类似的错误。

Obviously I can just open with excel and save as.xlsx but this defeats the entire purpose, and I can't do that for 100's of files.显然,我可以用 excel 打开并另存为.xlsx 但这违背了整个目的,而且我不能为 100 个文件这样做。

You need to use xlrd library to read the hyperlinks properly, pandas to merge all data together and xlsxwriter to write the data properly.您需要使用 xlrd 库正确读取超链接,使用 pandas 将所有数据合并在一起,并使用 xlsxwriter 正确写入数据。 Assuming all input files have same format, you can use below code.假设所有输入文件具有相同的格式,您可以使用以下代码。

# imports
import os
import xlrd
import xlsxwriter
import pandas as pd

# required functions
def load_excel_to_df(filepath, hyperlink_col):
    book = xlrd.open_workbook(file_path)
    sheet = book.sheet_by_index(0)
    hyperlink_map = sheet.hyperlink_map
    
    data = pd.read_excel(filepath)
    hyperlink_col_index = list(data.columns).index(hyperlink_col)
    
    required_links = [v.url_or_path for k, v in hyperlink_map.items() if k[1] == hyperlink_col_index]
    data['hyperlinks'] = required_links
    return data

# main code
# set required variables
input_data_dir = 'path/to/input/data/'
hyperlink_col = 'Publication_Number'
output_data_dir = 'path/to/output/data/'
output_filename = 'combined_data.xlsx'

# read and combine data
required_files = os.listdir(input_data_dir)
combined_data = pd.DataFrame()
for file in required_files:
    curr_data = load_excel_to_df(data_dir + os.sep + file, hyperlink_col)
    combined_data = combined_data.append(curr_data, sort=False, ignore_index=True)
cols = list(combined_data.columns)
m, n = combined_data.shape
hyperlink_col_index = cols.index(hyperlink_col)

# writing data
writer = pd.ExcelWriter(output_data_dir + os.sep + output_filename, engine='xlsxwriter')
combined_data[cols[:-1]].to_excel(writer, index=False, startrow=1, header=False) # last column contains hyperlinks
workbook  = writer.book
worksheet = writer.sheets[list(workbook.sheetnames.keys())[0]]
for i, col in enumerate(cols[:-1]):
    worksheet.write(0, i, col)
for i in range(m):
    worksheet.write_url(i+1, hyperlink_col_index, combined_data.loc[i, cols[-1]], string=combined_data.loc[i, hyperlink_col])
writer.save()

References:参考:

  1. reading hyperlinks - https://stackoverflow.com/a/7057076/17256762阅读超链接 - https://stackoverflow.com/a/7057076/17256762
  2. pandas to_excel header formatting - Remove default formatting in header when converting pandas DataFrame to excel sheet pandas to_excel header formatting - Remove default formatting in header when converting pandas DataFrame to excel sheet
  3. writing hyperlinks with xlsxwriter - https://xlsxwriter.readthedocs.io/example_hyperlink.html使用 xlsxwriter 编写超链接 - https://xlsxwriter.readthedocs.io/example_hyperlink.html

Without a clear reproducible example, the problem is not clear.如果没有明确的可重现示例,问题就不清楚了。 Assume I have two files called tmp.xls and tmp2.xls containing dummy data as in the two screenshots below.假设我有两个名为tmp.xlstmp2.xls的文件,其中包含虚拟数据,如下面的两个屏幕截图所示。

在此处输入图像描述

在此处输入图像描述

Then pandas can easily, load, concatenate, and convert to .xlsx format without loss of hyperlinks.然后pandas可以轻松加载、连接并转换为.xlsx格式,而不会丢失超链接。 Here is some demo code and the resulting file:这是一些演示代码和生成的文件:

import pandas as pd

f1 = pd.read_excel('tmp.xls')
f2 = pd.read_excel('tmp2.xls')

f3 = pd.concat([f1, f2], ignore_index=True)

f3.to_excel('./f3.xlsx')

在此处输入图像描述

I assume the same as daedalus in terms of the excel files.我假设 excel 文件与 daedalus 相同。 Instead of pandas I use openpyxl to read and create a new excel file.我使用openpyxl而不是 pandas 来读取并创建一个新的 excel 文件。

import openpyxl

wb1 = openpyxl.load_workbook('tmp.xlsx')
ws1 = wb.get_sheet_by_name('Sheet1')

wb2 = openpyxl.load_workbook('tmp2.xlsx')
ws2 = wb.get_sheet_by_name('Sheet1')

csvDict = {}

# Go through first sheet to find the hyperlinks and keys.
for (row in ws1.max_row):
    hyperlink_dict[ws1.cell(row=row, column=1).value] =
       [ws1.cell(row=row, column=2).hyperlink.target,
        ws1.cell(row=row, column=2).value]
 
# Go Through second sheet to find hyperlinks and keys.
for (row in ws2.max_row):
    hyperlink_dict[ws2.cell(row=row, column=1).value] =
       [ws2.cell(row=row, column=2).hyperlink.target,
        ws2.cell(row=row, column=2).value]

Now you have all the data so you can create a new workbook and save the values from the dict into it via opnenpyxl.现在您拥有所有数据,因此您可以创建一个新工作簿并通过 opnenpyxl 将 dict 中的值保存到其中。

wb = Workbook(write_only=true)
ws = wb.create_sheet()

for irow in len(csvDict):
    #use ws.append() to add the data from the csv.

wb.save('new_big_file.xlsx')

https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode https://openpyxl.readthedocs.io/en/stable/optimized.html#write-only-mode

Inspired by @Kunal, I managed to write code that avoids using Pandas libraries.受@Kunal 的启发,我设法编写了避免使用 Pandas 库的代码。 .xls files are read by xlrd, and written to a new excel file by xlwt. .xls 文件由 xlrd 读取,并由 xlwt 写入新的 excel 文件。 Hyperlinks are maintened, and output file was saved as.xlsx format:超链接维护,output文件保存为.xlsx格式:

import os
import xlwt
from xlrd import open_workbook

# read and combine data
directory = "random_directory"
required_files = os.listdir(directory)

#Define new file and sheet to get files into
new_file = xlwt.Workbook(encoding='utf-8', style_compression = 0)
new_sheet = new_file.add_sheet('Sheet1', cell_overwrite_ok = True)

#Initialize header row, can be done with any file 
old_file = open_workbook(directory+"/"+required_files[0], formatting_info=True)
old_sheet = old_file.sheet_by_index(0)
for column in list(range(0, old_sheet.ncols)):
    new_sheet.write(0, column, old_sheet.cell(0, column).value) #To create header row

#Add rows from all files present in folder 
for file in required_files:
    old_file = open_workbook(directory+"/"+file, formatting_info=True) 
    old_sheet = old_file.sheet_by_index(0) #Define old sheet
    hyperlink_map = old_sheet.hyperlink_map #Create map of all hyperlinks
    for row in range(1, old_sheet.nrows): #We need all rows except header row
        if row-1 < len(hyperlink_map.items()): #Statement to ensure we do not go out of range on the lower side of hyperlink_map.items()
            Row_depth=len(new_sheet._Worksheet__rows) #We need row depth to know where to add new row           
            for col in list(range(old_sheet.ncols)): #For every column we need to add row cell
                if col is 1: #We need to make an exception for column 2 being the hyperlinked column
                    click=list(hyperlink_map.items())[row-1][1].url_or_path #define URL
                    new_sheet.write(Row_depth, col, xlwt.Formula('HYPERLINK("{}", "{}")'.format(click, old_sheet.cell(row, 1).value)))
                else: #If not hyperlinked column
                    new_sheet.write(Row_depth, col, old_sheet.cell(row, col).value) #Write cell

new_file.save("random_directory/output_file.xlsx")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM