简体   繁体   English

Excel 到 Python 字典,用于使用 Openpyxl 进行过滤

[英]Excel to Python Dictionary for Filtering with Openpyxl

I'm trying to read an excel file to store the selected columns into a python dictionary.我正在尝试读取 excel 文件以将选定的列存储到 python 字典中。 For further data filtering and manipulation.用于进一步的数据过滤和操作。 I consulted a lot of Stackoverflow existing questions to get the pointers.我查阅了很多 Stackoverflow 现有问题以获取指示。 However I'm able to figure out few things.但是,我能够弄清楚几件事。 But having no prior experience with Python giving me some real challenges.但是之前没有使用 Python 的经验给我带来了一些真正的挑战。 Can I ask for your help please?我可以请你帮忙吗? Following is the code which I was able to make it work up to some extent.以下是我能够使其在某种程度上发挥作用的代码。

from _collections import defaultdict
import openpyxl

SalesFunnel = defaultdict(list)

theFile = openpyxl.load_workbook('Report.xlsx')
allSheetNames = theFile.sheetnames

print("All sheet names {} " .format(theFile.sheetnames))

for sheet in allSheetNames:
    print("Current sheet name is {}" .format(sheet))
    currentSheet = theFile[sheet]


sfunnel = []


for row in range(1, currentSheet.max_row + 1):
    for column in "ADEF":  
        cell_name = "{}{}".format(column, row)
        SalesFunnel[row] = [cell_name, currentSheet[cell_name].value]
        SalesFunnel[row].append(SalesFunnel[row])
    print(SalesFunnel)

My excel dataset contains duplicate emails and the duplicate lead statuses.我的 excel 数据集包含重复的电子邮件和重复的潜在客户状态。 Each row contains a created Date.每行包含一个创建日期。 I need to find out the max and min of the date for each email within a lead status.我需要找出领先状态下每个 email 的日期的最大值和最小值。 So that I can compute the days in between for each email address.这样我就可以计算每个 email 地址之间的天数。 But for now I'm unable to read the data in a correct format.但现在我无法以正确的格式读取数据。 I also added a column for the unique index which is simply the number sequence to uniquely identify each row.我还为唯一索引添加了一列,它只是唯一标识每一行的数字序列。

If I got something like that in a json format.如果我在 json 格式中得到类似的东西。 It would be great.那会很好。

{Index: 1, LeadStatus: Contacted, Email: joe@doe.com, CreatedDate: 7/9/2020}
{Index: 2, LeadStatus: Contacted, Email: joe@doe.com, CreatedDate: 8/10/2020}
{Index: 3, LeadStatus: Contacted, Email: joe@doe.com, CreatedDate: 9/11/2020}
{Index: 4, LeadStatus: Contacted, Email: ron@email.com, CreatedDate: 4/5/2020}
{Index: 5, LeadStatus: Contacted, Email: ron@email.com, CreatedDate: 7/6/2020}

Also adding a screenshot of my excel sheet.还添加了我的 excel 表的屏幕截图。 You can only see one email because I have thousands of records.你只能看到一个 email 因为我有数千条记录。 And for each email there could be many records.对于每个 email 可能有很多记录。 And Lead Status could be something other than Contacted.潜在客户状态可能不是已联系。

在此处输入图像描述

You can read the data using pandas .您可以使用pandas读取数据。 I am assuming your data starts from cell A1 and you are reading everything in the sheet.我假设您的数据从单元格A1开始,并且您正在阅读工作表中的所有内容。

data_df = pd.read_excel(sheet_path, sheet_name)

Now you can calculate the max and min dates for each group and find your differences现在您可以计算每个组的maxmin日期并找到您的差异

data_df['MaxDate'] = data_df.groupby(['LeadStatus','Email'])['CreatedDate'].transform('max')
data_df['MinDate'] = data_df.groupby(['LeadStatus','Email'])['CreatedDate'].transform('min')
data_df['Difference'] = pd.to_datetime(data_df['MaxDate']) - pd.to_datetime(data_df['MinDate'])

If you don't want to repeat records, use agg如果您不想重复记录,请使用agg

agg_df = data_df.groupby(['LeadId','LeadStatus','Email']).agg(MaxDate=('CreatedDate','max'),
                     MinDate = ('CreatedDate', 'min')).reset_index()
agg_df['Difference'] = pd.to_datetime(agg_df['MaxDate']) - pd.to_datetime(agg_df['MinDate'])

You can now convert to json if you like:如果您愿意,您现在可以转换为json

data_df.to_json(orient='records')

You can also write to excel您也可以写入 excel

with pd.ExcelWriter('..../new_doc.xlsx', engine='xlsxwriter') as writer:

    data_df.to_excel(writer, sheet_name='New Data', index=False)
    agg_df.to_excel(writer, sheet_name='Agg Data', index=False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM