一种高效且实用的方式来遍历Dataframe并逐行写入具有大量文本数据的文本文件

Question

I have a large dataframe where each row is containing large amount of text data, I am trying to partition this dataframe on some column in my dataframe ie column 11 and then write into multiple files 我有一个很大的数据框，其中每一行都包含大量文本数据，我试图将此数据框分区到我的数据框的某个列（即列11）中，然后写入多个文件中

partitioncount = 5
trainingDataFile = 'sometrainingDatFileWithHugeTextDataInEachColumn.tsv'
df = pd.read_table(trainingDataFile, sep='\t', header=None, encoding='utf-8')
# prepare output files and keep them to append the dataframe rows
outputfiles = {}
filename = "C:\Input_Partition"
for i in range(partitioncount):
    outputfiles[i] = open(filename + "_%s.tsv"%(i,), "a")

#Loop through the dataframe and write to buckets/files
for index, row in df.iterrows():
    #partition on a hash function
    partition = hash(row[11]) % partitioncount
     outputfiles[partition].write("\t".join([str(num) for num in df.iloc[index].values]) + "\n")

This code results with in error : IndexError Traceback (most recent call last) in () ---> 73 outputfiles[partition].write("\\t".join([str(num) for num in df.iloc[index].values]) + "\\n") 这段代码导致错误：（）---> 73 outputfiles [partition] .write（“ \\ t” .join（[str（num）for df.iloc [index中的num） ] .values]）+“ \\ n”）

c:\\python27\\lib\\site-packages\\pandas\\core\\indexing.pyc in getitem (self, key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) -> 1328 return self._getitem_axis(key, axis=0) 1329 1330 def _is_scalar_access(self, key): c：\\ python27 \\ lib \\ site-packages \\ pandas \\ core \\ indexing.pyc in getitem （self，key）1326 else：1327键= com._apply_if_callable（key，self.obj）-> 1328返回self._getitem_axis（key ，axis = 0）1329 1330 def _is_scalar_access（自己，键）：

c:\\python27\\lib\\site-packages\\pandas\\core\\indexing.pyc in _getitem_axis(self, key, axis) 1747 1748 # validate the location -> 1749 self._is_valid_integer(key, axis) 1750 1751 return self._get_loc(key, axis=axis) c：\\ python27 \\ lib \\ site-packages \\ pandas \\ core \\ indexing.pyc in _getitem_axis（self，key，axis）1747 1748＃验证位置-> 1749 self._is_valid_integer（key，axis）1750 1751返回self._get_loc （键，轴=轴）

c:\\python27\\lib\\site-packages\\pandas\\core\\indexing.pyc in _is_valid_integer(self, key, axis) 1636 l = len(ax) 1637 if key >= l or key < -l: -> 1638 raise IndexError("single positional indexer is out-of-bounds") 1639 return True 1640 _is_valid_integer中的c：\\ python27 \\ lib \\ site-packages \\ pandas \\ core \\ indexing.pyc（自身，键，轴）1636 l = len（ax）1637如果键> = l或键<-l：-> 1638升高IndexError（“单个位置索引器超出范围”）1639返回True 1640

IndexError: single positional indexer is out-of-bounds IndexError：单个位置索引器超出范围

What is the most efficient and scalable way to do this ie iterate data frame's rows , do some operations on rows (which I am not showing in code above and irrelevant to the problem in hand) and finally write each row (with large amount of text data) to a text file. 什么是最有效，最可扩展的方法，即迭代数据帧的行，对行进行一些操作（我在上面的代码中没有显示并且与手头的问题无关）并最终写入每行（包含大量文本）数据）转换为文本文件。

Appreciate any help! 感谢任何帮助！

Answer 1

IIUC you can do it this way: IIUC您可以通过以下方式进行操作：

filename = r'/path/to/output_{}.csv'

df.groupby(df.iloc[:, 11].map(hash) % partitioncount) \
  .apply(lambda g: g.to_csv(filename.format(g.name), sep='\t', index=False))

一种高效且实用的方式来遍历Dataframe并逐行写入具有大量文本数据的文本文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-09-20 19:05:55

一种高效且实用的方式来遍历Dataframe并逐行写入具有大量文本数据的文本文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-09-20 19:05:55

解决方案1
0 已采纳 2017-09-20 19:05:55