Python - 处理后将列表分块到数据帧中

Question

I'm using a chunk function to pre-process my data for ML because my data fairly large.我正在使用块函数为 ML 预处理我的数据，因为我的数据相当大。

After data processing I'm trying to add the processed data back into the original data frame as a new column 'chunk' this gives me a memory error so I'm trying to load chunks at a time into the dataframe but I still get a memory error:数据处理后，我试图将处理后的数据作为新列“块”添加回原始数据帧，这给了我一个内存错误，所以我试图一次将块加载到数据帧中，但我仍然得到一个内存错误：

MemoryError: Unable to allocate array with shape (414, 100, 32765) and data type float64

Here's my data:这是我的数据：

 Antibiotic  ...                                             Genome
0       isoniazid  ...  ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1       isoniazid  ...  gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2       isoniazid  ...  aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3       isoniazid  ...  gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4       isoniazid  ...  ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...

Here's my current code:这是我当前的代码：

lookup = {
  'a': 0.25,
  'g': 0.50,
  'c': 0.75,
  't': 1.00,
  'A': 0.25,
  'G': 0.50,
  'C': 0.75,
  'T': 1.00
  # z: 0.00
}


dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath, chunksize=100)

chunk_list = []
def preprocess(chunk):
  processed_chunk = chunk['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values
  return processed_chunk;


for chunk in dataframe:
  chunk_filter = preprocess(chunk)
  chunk_list.append(chunk_filter)
  chunk_array = np.asarray(chunk_list)

for chunk in chunk_array:
  dataframe1 = dataframe.copy()
  dataframe1["Chunk"] = chunk_array


dataframe1.to_csv(r'C:\\Users\\CAAVR\\Desktop\\chunk.csv')

If you need anymore info let me know.如果您需要更多信息，请告诉我。 Thanks谢谢

Answer 1

Instead of combining all the chunks in memory, which just takes you back to the problem of running out of memory, I would suggest instead writing each chunk out separately.我建议不要将内存中的所有块组合在一起，这只会让您回到内存不足的问题，而是建议将每个块单独写出。

If you open a file in append mode ( f = open('out.csv', 'a') ), you can do dataframe.to_csv(f) multiple times.如果您以追加模式（ f = open('out.csv', 'a') ）打开文件，则可以多次执行dataframe.to_csv(f) 。 The first time it'll write columns, later calls do dataframe.to_csv(f, header=False) since you've already written the column headers earlier.第一次写入列，稍后调用 do dataframe.to_csv(f, header=False)因为您之前已经编写了列标题。

Python - 处理后将列表分块到数据帧中

问题描述

1 个解决方案

解决方案1
0 2020-02-12 15:20:04

Python - 处理后将列表分块到数据帧中

问题描述

1 个解决方案

解决方案1 0 2020-02-12 15:20:04

解决方案1
0 2020-02-12 15:20:04