簡體   English   中英

什么是基於Pandas中的其他變量結果填充新變量的更有效方法

[英]What is a more efficent way to populate new variables based on other variable results in Pandas

我正在使用Pandas使用以其他數據變量為條件的值填充6個新變量。 整個數據集包含大約700,000行和14個變量(列),包括我新添加的變量。

我的第一種方法是使用itertuples() ,主要是在這里體驗最小化。 這大約花了9600秒。

我已經設法通過使用apply()來提高效率(~3500秒)。 以下是其中一個新變量的示例。


housing_df = utils.make_data_frame("data/source_data/housing_with_child.dta", "stata")

""" Populate new variables """
# setup new variables
housing_df["hh_n"] = ""
housing_df["bio_d"] = ""
housing_df["step_d"] = ""
housing_df["child_d"] = ""
housing_df["both_bio"] = ""
housing_df["hhtype"] = ""
housing_df["step"] = ""


# hh_n
def hh_n(row):
    df = housing_df.loc[
            (housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
        ]
    return str(len(df.index) + 1)

housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)

每個新變量都遵循相同的模式,需要執行以下操作:

For each row
Get some of the rows existing data (eg pipd and wave)
Find rows in the whole data frame which have the same values (pidp and wave) and put these in a new dataframe
Count the rows we found in the new dataframe
Return the count value for the new variable (hh_n)

以下是輸出數據的一個小例子:

例如輸出

如果它是相關的,這是我在第一行中從stata文件創建數據幀的方法:

# Method that returns data frame from stata file or csv
def make_data_frame(data, type):
    if type is "stata":
        reader = pd.read_stata(data, chunksize=100000)
    if type is "csv":
        reader = pd.read_csv(data, chunksize=100000)
    df = pd.DataFrame()

    for itm in reader:
        df = df.append(itm)
    return df

這是前50行數據的示例

{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99}, 'pidp': {0: 1367, 1: 2051, 2: 2727, 3: 2727, 4: 2727, 5: 2727, 6: 2727, 7: 3407, 8: 3407, 9: 3407, 10: 3407, 11: 3407, 12: 3407, 13: 3407, 14: 3407, 15: 3407, 16: 3407, 17: 3407, 18: 3407, 19: 3407, 20: 3407, 21: 3407, 22: 3407, 23: 3407, 24: 3407, 25: 3407, 26: 3407, 27: 3407, 28: 3407, 29: 4091, 30: 4091, 31: 4091, 32: 4091, 33: 4091, 34: 4091, 35: 4091, 36: 4091, 37: 4767, 38: 4767, 39: 4767, 40: 4767, 41: 4767, 42: 4767, 43: 5451, 44: 5451, 45: 5451, 46: 5451, 47: 5451, 48: 5451, 49: 6135, 50: 6135, 51: 6135, 52: 6135, 53: 6135, 54: 6807, 55: 6807, 56: 6844, 57: 6844, 58: 6844, 59: 6844, 60: 6844, 61: 6844, 62: 6844, 63: 6844, 64: 6844, 65: 6844, 66: 6844, 67: 6844, 68: 6844, 69: 6844, 70: 6844, 71: 6844, 72: 6844, 73: 6844, 74: 6844, 75: 6844, 76: 6844, 77: 6844, 78: 6844, 79: 6844, 80: 6844, 81: 6844, 82: 6844, 83: 6844, 84: 6844, 85: 6844, 86: 6844, 87: 6844, 88: 6844, 89: 6844, 90: 6844, 91: 6844, 92: 6844, 93: 6844, 94: 6844, 95: 6844, 96: 6844, 97: 6844, 98: 6844, 99: 6844}, 'hidp': {0: 20402, 1: 20402, 2: 5799043, 3: 60159608, 4: 45594010, 5: 53475212, 6: 15449616, 7: 102002, 8: 102002, 9: 30559202, 10: 30559202, 11: 6351204, 12: 6351204, 13: 2590808, 14: 13610, 15: 6812, 16: 13614, 17: 20416, 18: 21902818, 19: 36407220, 20: 36407220, 21: 20424, 22: 20424, 23: 20426, 24: 20426, 25: 9010028, 26: 9010028, 27: 18856430, 28: 18856430, 29: 136002, 30: 136002, 31: 30566002, 32: 30566002, 33: 6358004, 34: 6358004, 35: 20406, 36: 20406, 37: 210802, 38: 210802, 39: 30579602, 40: 30579602, 41: 30579602, 42: 6371604, 43: 210802, 44: 210802, 45: 30579602, 46: 30579602, 47: 30579602, 48: 6371604, 49: 210802, 50: 210802, 51: 30579602, 52: 30579602, 53: 30579602, 54: 285602, 55: 30593202, 56: 37073606, 57: 37073606, 58: 37073606, 59: 37073606, 60: 37073606, 61: 12580008, 62: 12580008, 63: 12580008, 64: 12580008, 65: 12580008, 66: 56052408, 67: 56052408, 68: 56052408, 69: 56052408, 70: 56052408, 71: 38848410, 72: 38848410, 73: 38848410, 74: 38848410, 75: 38848410, 76: 50014012, 77: 50014012, 78: 50014012, 79: 5467216, 80: 5467216, 81: 5467216, 82: 25547618, 83: 25547618, 84: 25547618, 85: 38610420, 86: 38610420, 87: 38610420, 88: 5025224, 89: 5025224, 90: 5025224, 91: 4637626, 92: 4637626, 93: 4637626, 94: 12784028, 95: 12784028, 96: 12784028, 97: 21719230, 98: 21719230, 99: 21719230}, 'apidp': {0: 2051, 1: 1367, 2: 752052125, 3: 752052125, 4: 752052125, 5: 752052125, 6: 752052125, 7: 545740805, 8: 612001365, 9: 545740805, 10: 612001365, 11: 545740805, 12: 612001365, 13: 612001365, 14: 612001365, 15: 612001365, 16: 612001365, 17: 612001365, 18: 612001365, 19: 612001365, 20: 612001369, 21: 612001365, 22: 612001369, 23: 612001365, 24: 612001369, 25: 612001365, 26: 612001369, 27: 612001365, 28: 612001369, 29: 68002045, 30: 68002049, 31: 68002045, 32: 68002049, 33: 68002045, 34: 68002049, 35: 68002045, 36: 68002049, 37: 5451, 38: 6135, 39: 5451, 40: 6135, 41: 5298579, 42: 5451, 43: 4767, 44: 6135, 45: 4767, 46: 6135, 47: 5298579, 48: 4767, 49: 4767, 50: 5451, 51: 4767, 52: 5451, 53: 5298579, 54: 4885131, 55: 4885131, 56: 13644, 57: 748469885, 58: 748469889, 59: 749992405, 60: 749992409, 61: 13644, 62: 748469885, 63: 748469889, 64: 749992405, 65: 749992409, 66: 13644, 67: 748469885, 68: 748469889, 69: 749992405, 70: 749992409, 71: 13644, 72: 748469885, 73: 748469889, 74: 749992405, 75: 749992409, 76: 13644, 77: 748469885, 78: 748469889, 79: 13644, 80: 748469885, 81: 748469889, 82: 13644, 83: 748469885, 84: 748469889, 85: 13644, 86: 748469885, 87: 748469889, 88: 13644, 89: 748469885, 90: 748469889, 91: 13644, 92: 748469885, 93: 748469889, 94: 13644, 95: 748469885, 96: 748469889, 97: 13644, 98: 748469885, 99: 748469889}, 'wave': {0: 2.0, 1: 2.0, 2: 2.0, 3: 8.0, 4: 9.0, 5: 10.0, 6: 11.0, 7: 2.0, 8: 2.0, 9: 3.0, 10: 3.0, 11: 4.0, 12: 4.0, 13: 7.0, 14: 8.0, 15: 9.0, 16: 10.0, 17: 11.0, 18: 12.0, 19: 13.0, 20: 13.0, 21: 14.0, 22: 14.0, 23: 15.0, 24: 15.0, 25: 16.0, 26: 16.0, 27: 17.0, 28: 17.0, 29: 2.0, 30: 2.0, 31: 3.0, 32: 3.0, 33: 4.0, 34: 4.0, 35: 5.0, 36: 5.0, 37: 2.0, 38: 2.0, 39: 3.0, 40: 3.0, 41: 3.0, 42: 4.0, 43: 2.0, 44: 2.0, 45: 3.0, 46: 3.0, 47: 3.0, 48: 4.0, 49: 2.0, 50: 2.0, 51: 3.0, 52: 3.0, 53: 3.0, 54: 2.0, 55: 3.0, 56: 6.0, 57: 6.0, 58: 6.0, 59: 6.0, 60: 6.0, 61: 7.0, 62: 7.0, 63: 7.0, 64: 7.0, 65: 7.0, 66: 8.0, 67: 8.0, 68: 8.0, 69: 8.0, 70: 8.0, 71: 9.0, 72: 9.0, 73: 9.0, 74: 9.0, 75: 9.0, 76: 10.0, 77: 10.0, 78: 10.0, 79: 11.0, 80: 11.0, 81: 11.0, 82: 12.0, 83: 12.0, 84: 12.0, 85: 13.0, 86: 13.0, 87: 13.0, 88: 14.0, 89: 14.0, 90: 14.0, 91: 15.0, 92: 15.0, 93: 15.0, 94: 16.0, 95: 16.0, 96: 16.0, 97: 17.0, 98: 17.0, 99: 17.0}, 'rel': {0: 'unrelated sharer', 1: 'unrelated sharer', 2: 'natural child', 3: 'natural child', 4: 'natural child', 5: 'natural child', 6: 'natural child', 7: 'lawful spouse', 8: 'natural child', 9: 'lawful spouse', 10: 'natural child', 11: 'lawful spouse', 12: 'natural child', 13: 'natural child', 14: 'natural child', 15: 'natural child', 16: 'natural child', 17: 'natural child', 18: 'natural child', 19: 'natural child', 20: 'daughter/son-in-law', 21: 'natural child', 22: 'daughter/son-in-law', 23: 'natural child', 24: 'daughter/son-in-law', 25: 'natural child', 26: 'daughter/son-in-law', 27: 'natural child', 28: 'daughter/son-in-law', 29: 'lawful spouse', 30: 'natural child', 31: 'lawful spouse', 32: 'natural child', 33: 'lawful spouse', 34: 'natural child', 35: 'lawful spouse', 36: 'natural child', 37: 'unrelated sharer', 38: 'natural child', 39: 'other', 40: 'natural child', 41: 'daughter/son-in-law', 42: 'other', 43: 'unrelated sharer', 44: 'natural child', 45: 'other', 46: 'natural child', 47: 'daughter/son-in-law', 48: 'other', 49: 'natural parent', 50: 'natural parent', 51: 'natural parent', 52: 'natural parent', 53: 'lawful spouse', 54: 'natural brother/sister', 55: 'natural brother/sister', 56: 'natural child', 57: 'lawful spouse', 58: 'natural child', 59: 'mother/father-in-law', 60: 'mother/father-in-law', 61: 'natural child', 62: 'lawful spouse', 63: 'natural child', 64: 'mother/father-in-law', 65: 'mother/father-in-law', 66: 'natural child', 67: 'lawful spouse', 68: 'natural child', 69: 'mother/father-in-law', 70: 'mother/father-in-law', 71: 'natural child', 72: 'lawful spouse', 73: 'natural child', 74: 'mother/father-in-law', 75: 'mother/father-in-law', 76: 'natural child', 77: 'lawful spouse', 78: 'natural child', 79: 'natural child', 80: 'lawful spouse', 81: 'natural child', 82: 'natural child', 83: 'lawful spouse', 84: 'natural child', 85: 'natural child', 86: 'lawful spouse', 87: 'natural child', 88: 'natural child', 89: 'lawful spouse', 90: 'natural child', 91: 'natural child', 92: 'lawful spouse', 93: 'natural child', 94: 'natural child', 95: 'lawful spouse', 96: 'natural child', 97: 'natural child', 98: 'lawful spouse', 99: 'natural child'}, 'is_child': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '1', 9: '', 10: '1', 11: '', 12: '1', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '1', 31: '', 32: '1', 33: '', 34: '1', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '1', 57: '', 58: '1', 59: '', 60: '', 61: '1', 62: '', 63: '1', 64: '', 65: '', 66: '1', 67: '', 68: '1', 69: '', 70: '', 71: '1', 72: '', 73: '1', 74: '', 75: '', 76: '1', 77: '', 78: '1', 79: '1', 80: '', 81: '1', 82: '1', 83: '', 84: '1', 85: '1', 86: '', 87: '1', 88: '1', 89: '', 90: '1', 91: '1', 92: '', 93: '1', 94: '1', 95: '', 96: '1', 97: '1', 98: '', 99: '1'}, 'hh_n': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'bio_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'child_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'both_bio': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'hhtype': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}}

前50行

如上所述,考慮groupby().transform()用於任何分組的內聯聚合,以將新列分配給當前數據幀。 由於OP的需求基本上是計數(即,從邏輯條件返回索引的長度或行數),因此請明確指定count聚合。

因此,下面的.apply() (每次迭代構建一個輔助數據框的隱藏循環)

def hh_n(row):
    # BOOLEAN INDEXING + LEN()
    df = housing_df.loc[
                (housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
            ]
    return str(len(df.index) + 1)

housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)

可以調整為:

housing_df["hh_n"] = housing_df.groupby(['pidp', 'wave']).transform('count') + 1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM