简体   繁体   English

一种高效且实用的方式来遍历Dataframe并逐行写入具有大量文本数据的文本文件

[英]Efficient and salable way to Iterate over a Dataframe and write to text file row by row having huge amount of text data

I have a large dataframe where each row is containing large amount of text data, I am trying to partition this dataframe on some column in my dataframe ie column 11 and then write into multiple files 我有一个很大的数据框,其中每一行都包含大量文本数据,我试图将此数据框分区到我的数据框的某个列(即列11)中,然后写入多个文件中

partitioncount = 5
trainingDataFile = 'sometrainingDatFileWithHugeTextDataInEachColumn.tsv'
df = pd.read_table(trainingDataFile, sep='\t', header=None, encoding='utf-8')
# prepare output files and keep them to append the dataframe rows
outputfiles = {}
filename = "C:\Input_Partition"
for i in range(partitioncount):
    outputfiles[i] = open(filename + "_%s.tsv"%(i,), "a")

#Loop through the dataframe and write to buckets/files
for index, row in df.iterrows():
    #partition on a hash function
    partition = hash(row[11]) % partitioncount
     outputfiles[partition].write("\t".join([str(num) for num in df.iloc[index].values]) + "\n")

This code results with in error : IndexError Traceback (most recent call last) in () ---> 73 outputfiles[partition].write("\\t".join([str(num) for num in df.iloc[index].values]) + "\\n") 这段代码导致错误:()---> 73 outputfiles [partition] .write(“ \\ t” .join([str(num)for df.iloc [index中的num) ] .values])+“ \\ n”)

c:\\python27\\lib\\site-packages\\pandas\\core\\indexing.pyc in getitem (self, key) 1326 else: 1327 key = com._apply_if_callable(key, self.obj) -> 1328 return self._getitem_axis(key, axis=0) 1329 1330 def _is_scalar_access(self, key): c:\\ python27 \\ lib \\ site-packages \\ pandas \\ core \\ indexing.pyc in getitem (self,key)1326 else:1327键= com._apply_if_callable(key,self.obj)-> 1328返回self._getitem_axis(key ,axis = 0)1329 1330 def _is_scalar_access(自己,键):

c:\\python27\\lib\\site-packages\\pandas\\core\\indexing.pyc in _getitem_axis(self, key, axis) 1747 1748 # validate the location -> 1749 self._is_valid_integer(key, axis) 1750 1751 return self._get_loc(key, axis=axis) c:\\ python27 \\ lib \\ site-packages \\ pandas \\ core \\ indexing.pyc in _getitem_axis(self,key,axis)1747 1748#验证位置-> 1749 self._is_valid_integer(key,axis)1750 1751返回self._get_loc (键,轴=轴)

c:\\python27\\lib\\site-packages\\pandas\\core\\indexing.pyc in _is_valid_integer(self, key, axis) 1636 l = len(ax) 1637 if key >= l or key < -l: -> 1638 raise IndexError("single positional indexer is out-of-bounds") 1639 return True 1640 _is_valid_integer中的c:\\ python27 \\ lib \\ site-packages \\ pandas \\ core \\ indexing.pyc(自身,键,轴)1636 l = len(ax)1637如果键> = l或键<-l:-> 1638升高IndexError(“单个位置索引器超出范围”)1639返回True 1640

IndexError: single positional indexer is out-of-bounds IndexError:单个位置索引器超出范围

What is the most efficient and scalable way to do this ie iterate data frame's rows , do some operations on rows (which I am not showing in code above and irrelevant to the problem in hand) and finally write each row (with large amount of text data) to a text file. 什么是最有效,最可扩展的方法,即迭代数据帧的行,对行进行一些操作(我在上面的代码中没有显示并且与手头的问题无关)并最终写入每行(包含大量文本)数据)转换为文本文件。

Appreciate any help! 感谢任何帮助!

IIUC you can do it this way: IIUC您可以通过以下方式进行操作:

filename = r'/path/to/output_{}.csv'

df.groupby(df.iloc[:, 11].map(hash) % partitioncount) \
  .apply(lambda g: g.to_csv(filename.format(g.name), sep='\t', index=False))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM