简体   繁体   English

Python数据框块列索引不正确

[英]Python Dataframe Chunk Column Indexing Incorrectly

I am learning DataFrame Chunking.我正在学习 DataFrame Chunking。 My pseudocode is simple:我的伪代码很简单:

  1. Break down the SOURCE_FILE into a number of chunks将 SOURCE_FILE 分解为多个块
  2. Load a chunk (with a loop)加载一个块(带循环)
  3. Add a column with a predicted label & another with confidence添加带有预测标签的列和另一个充满信心的列
  4. Write the chunk to the drive将块写入驱动器
  5. Continue the loop继续循环

The first chunk was saved as it was supposed to.第一个块按预期保存。 The new columns in the rest of the chunks have incorrect row indexes.其余块中的新列具有不正确的行索引。 I cannot figure out why this is happening.我无法弄清楚为什么会发生这种情况。 Will appreciate all help.将感谢所有帮助。

Also, is my chunking pseudocode correct?另外,我的分块伪代码是否正确? I am a little confused if this is the right way.如果这是正确的方法,我有点困惑。

# create chunks
for chunk in pd.read_csv(SOURCE_FILE, chunksize = CHUNK_SIZE):
    print('BATCH:', BATCH_NUMBER)
    
    # machine translate
    for row_index, text in enumerate(chunk.title):
        print('Text:', text)
        print('Row Index:', row_index)
        (label, confidence) = MODEL.predict(text)
        label = label[0]
        confidence = confidence[0]
        chunk.loc[row_index, 'Language'] = label[9:]
        chunk.loc[row_index, 'Confidence'] = confidence

    chunk.to_csv('Chunks/chunk' + str(BATCH_NUMBER) + '.csv' , index = False)
    BATCH_NUMBER += 1

You can see an image of the incorrect row indexing here您可以在此处查看错误行索引的图像

This is a little late, but I wanted to answer this in case anyone has the same problem in the future.这有点晚了,但我想回答这个问题,以防将来有人遇到同样的问题。 You are attempting to concatenate two new columns to an already existing chunk with two columns.您正在尝试将两个新列连接到具有两列的现有块。 Pandas concatenates columns side-by-side using the index of the columns, and the index of your first chunk matches the index of your two new columns, but the rest of the chunks have larger indexes, and therefore they aren't joined properly. Pandas 使用列的索引并排连接列,并且您的第一个块的索引与您的两个新列的索引匹配,但其余的块具有更大的索引,因此它们没有正确连接。 Reset the indexes of the rest of your chunks using使用重置其余块的索引

chunk.index = range(len(chunk))

And it should work.它应该工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM