繁体   English   中英

什么是基于Pandas中的其他变量结果填充新变量的更有效方法

[英]What is a more efficent way to populate new variables based on other variable results in Pandas

我正在使用Pandas使用以其他数据变量为条件的值填充6个新变量。 整个数据集包含大约700,000行和14个变量(列),包括我新添加的变量。

我的第一种方法是使用itertuples() ,主要是在这里体验最小化。 这大约花了9600秒。

我已经设法通过使用apply()来提高效率(~3500秒)。 以下是其中一个新变量的示例。


housing_df = utils.make_data_frame("data/source_data/housing_with_child.dta", "stata")

""" Populate new variables """
# setup new variables
housing_df["hh_n"] = ""
housing_df["bio_d"] = ""
housing_df["step_d"] = ""
housing_df["child_d"] = ""
housing_df["both_bio"] = ""
housing_df["hhtype"] = ""
housing_df["step"] = ""


# hh_n
def hh_n(row):
    df = housing_df.loc[
            (housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
        ]
    return str(len(df.index) + 1)

housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)

每个新变量都遵循相同的模式,需要执行以下操作:

For each row
Get some of the rows existing data (eg pipd and wave)
Find rows in the whole data frame which have the same values (pidp and wave) and put these in a new dataframe
Count the rows we found in the new dataframe
Return the count value for the new variable (hh_n)

以下是输出数据的一个小例子:

例如输出

如果它是相关的,这是我在第一行中从stata文件创建数据帧的方法:

# Method that returns data frame from stata file or csv
def make_data_frame(data, type):
    if type is "stata":
        reader = pd.read_stata(data, chunksize=100000)
    if type is "csv":
        reader = pd.read_csv(data, chunksize=100000)
    df = pd.DataFrame()

    for itm in reader:
        df = df.append(itm)
    return df

这是前50行数据的示例

{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99}, 'pidp': {0: 1367, 1: 2051, 2: 2727, 3: 2727, 4: 2727, 5: 2727, 6: 2727, 7: 3407, 8: 3407, 9: 3407, 10: 3407, 11: 3407, 12: 3407, 13: 3407, 14: 3407, 15: 3407, 16: 3407, 17: 3407, 18: 3407, 19: 3407, 20: 3407, 21: 3407, 22: 3407, 23: 3407, 24: 3407, 25: 3407, 26: 3407, 27: 3407, 28: 3407, 29: 4091, 30: 4091, 31: 4091, 32: 4091, 33: 4091, 34: 4091, 35: 4091, 36: 4091, 37: 4767, 38: 4767, 39: 4767, 40: 4767, 41: 4767, 42: 4767, 43: 5451, 44: 5451, 45: 5451, 46: 5451, 47: 5451, 48: 5451, 49: 6135, 50: 6135, 51: 6135, 52: 6135, 53: 6135, 54: 6807, 55: 6807, 56: 6844, 57: 6844, 58: 6844, 59: 6844, 60: 6844, 61: 6844, 62: 6844, 63: 6844, 64: 6844, 65: 6844, 66: 6844, 67: 6844, 68: 6844, 69: 6844, 70: 6844, 71: 6844, 72: 6844, 73: 6844, 74: 6844, 75: 6844, 76: 6844, 77: 6844, 78: 6844, 79: 6844, 80: 6844, 81: 6844, 82: 6844, 83: 6844, 84: 6844, 85: 6844, 86: 6844, 87: 6844, 88: 6844, 89: 6844, 90: 6844, 91: 6844, 92: 6844, 93: 6844, 94: 6844, 95: 6844, 96: 6844, 97: 6844, 98: 6844, 99: 6844}, 'hidp': {0: 20402, 1: 20402, 2: 5799043, 3: 60159608, 4: 45594010, 5: 53475212, 6: 15449616, 7: 102002, 8: 102002, 9: 30559202, 10: 30559202, 11: 6351204, 12: 6351204, 13: 2590808, 14: 13610, 15: 6812, 16: 13614, 17: 20416, 18: 21902818, 19: 36407220, 20: 36407220, 21: 20424, 22: 20424, 23: 20426, 24: 20426, 25: 9010028, 26: 9010028, 27: 18856430, 28: 18856430, 29: 136002, 30: 136002, 31: 30566002, 32: 30566002, 33: 6358004, 34: 6358004, 35: 20406, 36: 20406, 37: 210802, 38: 210802, 39: 30579602, 40: 30579602, 41: 30579602, 42: 6371604, 43: 210802, 44: 210802, 45: 30579602, 46: 30579602, 47: 30579602, 48: 6371604, 49: 210802, 50: 210802, 51: 30579602, 52: 30579602, 53: 30579602, 54: 285602, 55: 30593202, 56: 37073606, 57: 37073606, 58: 37073606, 59: 37073606, 60: 37073606, 61: 12580008, 62: 12580008, 63: 12580008, 64: 12580008, 65: 12580008, 66: 56052408, 67: 56052408, 68: 56052408, 69: 56052408, 70: 56052408, 71: 38848410, 72: 38848410, 73: 38848410, 74: 38848410, 75: 38848410, 76: 50014012, 77: 50014012, 78: 50014012, 79: 5467216, 80: 5467216, 81: 5467216, 82: 25547618, 83: 25547618, 84: 25547618, 85: 38610420, 86: 38610420, 87: 38610420, 88: 5025224, 89: 5025224, 90: 5025224, 91: 4637626, 92: 4637626, 93: 4637626, 94: 12784028, 95: 12784028, 96: 12784028, 97: 21719230, 98: 21719230, 99: 21719230}, 'apidp': {0: 2051, 1: 1367, 2: 752052125, 3: 752052125, 4: 752052125, 5: 752052125, 6: 752052125, 7: 545740805, 8: 612001365, 9: 545740805, 10: 612001365, 11: 545740805, 12: 612001365, 13: 612001365, 14: 612001365, 15: 612001365, 16: 612001365, 17: 612001365, 18: 612001365, 19: 612001365, 20: 612001369, 21: 612001365, 22: 612001369, 23: 612001365, 24: 612001369, 25: 612001365, 26: 612001369, 27: 612001365, 28: 612001369, 29: 68002045, 30: 68002049, 31: 68002045, 32: 68002049, 33: 68002045, 34: 68002049, 35: 68002045, 36: 68002049, 37: 5451, 38: 6135, 39: 5451, 40: 6135, 41: 5298579, 42: 5451, 43: 4767, 44: 6135, 45: 4767, 46: 6135, 47: 5298579, 48: 4767, 49: 4767, 50: 5451, 51: 4767, 52: 5451, 53: 5298579, 54: 4885131, 55: 4885131, 56: 13644, 57: 748469885, 58: 748469889, 59: 749992405, 60: 749992409, 61: 13644, 62: 748469885, 63: 748469889, 64: 749992405, 65: 749992409, 66: 13644, 67: 748469885, 68: 748469889, 69: 749992405, 70: 749992409, 71: 13644, 72: 748469885, 73: 748469889, 74: 749992405, 75: 749992409, 76: 13644, 77: 748469885, 78: 748469889, 79: 13644, 80: 748469885, 81: 748469889, 82: 13644, 83: 748469885, 84: 748469889, 85: 13644, 86: 748469885, 87: 748469889, 88: 13644, 89: 748469885, 90: 748469889, 91: 13644, 92: 748469885, 93: 748469889, 94: 13644, 95: 748469885, 96: 748469889, 97: 13644, 98: 748469885, 99: 748469889}, 'wave': {0: 2.0, 1: 2.0, 2: 2.0, 3: 8.0, 4: 9.0, 5: 10.0, 6: 11.0, 7: 2.0, 8: 2.0, 9: 3.0, 10: 3.0, 11: 4.0, 12: 4.0, 13: 7.0, 14: 8.0, 15: 9.0, 16: 10.0, 17: 11.0, 18: 12.0, 19: 13.0, 20: 13.0, 21: 14.0, 22: 14.0, 23: 15.0, 24: 15.0, 25: 16.0, 26: 16.0, 27: 17.0, 28: 17.0, 29: 2.0, 30: 2.0, 31: 3.0, 32: 3.0, 33: 4.0, 34: 4.0, 35: 5.0, 36: 5.0, 37: 2.0, 38: 2.0, 39: 3.0, 40: 3.0, 41: 3.0, 42: 4.0, 43: 2.0, 44: 2.0, 45: 3.0, 46: 3.0, 47: 3.0, 48: 4.0, 49: 2.0, 50: 2.0, 51: 3.0, 52: 3.0, 53: 3.0, 54: 2.0, 55: 3.0, 56: 6.0, 57: 6.0, 58: 6.0, 59: 6.0, 60: 6.0, 61: 7.0, 62: 7.0, 63: 7.0, 64: 7.0, 65: 7.0, 66: 8.0, 67: 8.0, 68: 8.0, 69: 8.0, 70: 8.0, 71: 9.0, 72: 9.0, 73: 9.0, 74: 9.0, 75: 9.0, 76: 10.0, 77: 10.0, 78: 10.0, 79: 11.0, 80: 11.0, 81: 11.0, 82: 12.0, 83: 12.0, 84: 12.0, 85: 13.0, 86: 13.0, 87: 13.0, 88: 14.0, 89: 14.0, 90: 14.0, 91: 15.0, 92: 15.0, 93: 15.0, 94: 16.0, 95: 16.0, 96: 16.0, 97: 17.0, 98: 17.0, 99: 17.0}, 'rel': {0: 'unrelated sharer', 1: 'unrelated sharer', 2: 'natural child', 3: 'natural child', 4: 'natural child', 5: 'natural child', 6: 'natural child', 7: 'lawful spouse', 8: 'natural child', 9: 'lawful spouse', 10: 'natural child', 11: 'lawful spouse', 12: 'natural child', 13: 'natural child', 14: 'natural child', 15: 'natural child', 16: 'natural child', 17: 'natural child', 18: 'natural child', 19: 'natural child', 20: 'daughter/son-in-law', 21: 'natural child', 22: 'daughter/son-in-law', 23: 'natural child', 24: 'daughter/son-in-law', 25: 'natural child', 26: 'daughter/son-in-law', 27: 'natural child', 28: 'daughter/son-in-law', 29: 'lawful spouse', 30: 'natural child', 31: 'lawful spouse', 32: 'natural child', 33: 'lawful spouse', 34: 'natural child', 35: 'lawful spouse', 36: 'natural child', 37: 'unrelated sharer', 38: 'natural child', 39: 'other', 40: 'natural child', 41: 'daughter/son-in-law', 42: 'other', 43: 'unrelated sharer', 44: 'natural child', 45: 'other', 46: 'natural child', 47: 'daughter/son-in-law', 48: 'other', 49: 'natural parent', 50: 'natural parent', 51: 'natural parent', 52: 'natural parent', 53: 'lawful spouse', 54: 'natural brother/sister', 55: 'natural brother/sister', 56: 'natural child', 57: 'lawful spouse', 58: 'natural child', 59: 'mother/father-in-law', 60: 'mother/father-in-law', 61: 'natural child', 62: 'lawful spouse', 63: 'natural child', 64: 'mother/father-in-law', 65: 'mother/father-in-law', 66: 'natural child', 67: 'lawful spouse', 68: 'natural child', 69: 'mother/father-in-law', 70: 'mother/father-in-law', 71: 'natural child', 72: 'lawful spouse', 73: 'natural child', 74: 'mother/father-in-law', 75: 'mother/father-in-law', 76: 'natural child', 77: 'lawful spouse', 78: 'natural child', 79: 'natural child', 80: 'lawful spouse', 81: 'natural child', 82: 'natural child', 83: 'lawful spouse', 84: 'natural child', 85: 'natural child', 86: 'lawful spouse', 87: 'natural child', 88: 'natural child', 89: 'lawful spouse', 90: 'natural child', 91: 'natural child', 92: 'lawful spouse', 93: 'natural child', 94: 'natural child', 95: 'lawful spouse', 96: 'natural child', 97: 'natural child', 98: 'lawful spouse', 99: 'natural child'}, 'is_child': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '1', 9: '', 10: '1', 11: '', 12: '1', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '1', 31: '', 32: '1', 33: '', 34: '1', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '1', 57: '', 58: '1', 59: '', 60: '', 61: '1', 62: '', 63: '1', 64: '', 65: '', 66: '1', 67: '', 68: '1', 69: '', 70: '', 71: '1', 72: '', 73: '1', 74: '', 75: '', 76: '1', 77: '', 78: '1', 79: '1', 80: '', 81: '1', 82: '1', 83: '', 84: '1', 85: '1', 86: '', 87: '1', 88: '1', 89: '', 90: '1', 91: '1', 92: '', 93: '1', 94: '1', 95: '', 96: '1', 97: '1', 98: '', 99: '1'}, 'hh_n': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'bio_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'child_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'both_bio': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'hhtype': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}}

前50行

如上所述,考虑groupby().transform()用于任何分组的内联聚合,以将新列分配给当前数据帧。 由于OP的需求基本上是计数(即,从逻辑条件返回索引的长度或行数),因此请明确指定count聚合。

因此,下面的.apply() (每次迭代构建一个辅助数据框的隐藏循环)

def hh_n(row):
    # BOOLEAN INDEXING + LEN()
    df = housing_df.loc[
                (housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
            ]
    return str(len(df.index) + 1)

housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)

可以调整为:

housing_df["hh_n"] = housing_df.groupby(['pidp', 'wave']).transform('count') + 1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM