简体   繁体   English

连接大型CSV的所有行

[英]Concatenate all rows of a large csv

So I have a large csv file (5 million rows) with multiple columns. 所以我有一个很大的csv文件(500万行),其中包含多列。 Of particular interest to me is a column which contains text. 我特别感兴趣的是包含文本的列。

The input csv is of the following format: 输入的csv具有以下格式:

system_id, member_name, message, is_post system_id,member_name,消息,is_post

0157e407,member1011, "I have had problems with my lungs for years now. It all started with an infection...", false 0157e407,member1011,“多年来我的肺部出现问题。这一切都始于感染……”,错误

1915d457, member1055, "Looks like a lot of people take Paracetamol for managing pain and....",false 1915d457,member1055,“看起来很多人服用扑热息痛治疗疼痛和....”,false

The column 'message' contains text and is of interest. “消息”列包含文本并且很有趣。

Now the task is to concatenate all the rows of this column into a one single large text, and then compute n-grams (n=1,2,3,4,5) on it. 现在的任务是将该列的所有行连接为一个大文本,然后在其上计算n-gram(n = 1,2,3,4,5)。 The output should be 5 different files corresponding to n-grams in the following format: For eg: 输出应为5个不同的文件,它们对应于以下格式的n-gram:例如:

bigram.csv bigram.csv

n-gram, count n克,计数

"word1 word2", 7 “ word1 word2”,7

"word1 word3", 11 “ word1 word3”,11

trigram.csv trigram.csv

n-gram, count n克,计数

"word1 word2 word3", 22 “ word1 word2 word3”,22

"word 1 word2 word4", 24 “单词1单词2单词4”,24

Here is what I have tried so far: 到目前为止,这是我尝试过的:

from collections import OrderedDict
import csv
import re
import sys

import nltk


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
        print "Usage: python %s <inp_file_path>" % sys.argv[0]
        exit(1)
    ifpath = sys.argv[1]
    with open(ifpath, 'r') as ifp:
        reader = csv.DictReader(ifp)
        all_msgs = []
        fieldnames = reader.fieldnames
        processed_rows = []
        for row in reader:
            msg = row['message']
            res = {'message': msg}
            txt = msg.decode('ascii', 'ignore')
            # some preprocessing
            txt = re.sub(r'[\.]{2,}', r". ", txt)
            txt = re.sub(r'([\.,;!?])([A-Z])', r'\1 \2', txt)
            sentences = nltk.tokenize.sent_tokenize(txt.strip())
            all_msgs.append(' '.join(sentences))
    text = ' '.join(all_msgs)

    tokens = nltk.word_tokenize(text)
    tokens = [token.lower() for token in tokens if len(token) > 1]
    bi_tokens = list(nltk.bigrams(tokens))
    tri_tokens = list(nltk.trigrams(tokens))
    bigrms = []
    for item in sorted(set(bi_tokens)):
        bb = OrderedDict()
        bb['bigrams'] = ' '.join(item)
        bb['count'] = bi_tokens.count(item)
        bigrms.append(bb)

    trigrms = []
    for item in sorted(set(tri_tokens)):
        tt = OrderedDict()
        tt['trigrams'] = ' '.join(item)
        tt['count'] = tri_tokens.count(item)
        trigrms.append(tt)

    with open('bigrams.csv', 'w') as ofp2:
        header = ['bigrams', 'count']
        dict_writer = csv.DictWriter(ofp2, header)
        dict_writer.writeheader()
        dict_writer.writerows(bigrms)

    with open('trigrams.csv', 'w') as ofp3:
        header = ['trigrams', 'count']
        dict_writer = csv.DictWriter(ofp3, header)
        dict_writer.writeheader()
        dict_writer.writerows(trigrms)

    tokens = nltk.word_tokenize(text)
    fourgrams = nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
    quadgrams = []
    for fourgram, freq in fourgrams.ngram_fd.items():
        dd = OrderedDict()
        dd['quadgram'] = " ".join(fourgram)
        dd['count'] = freq
        quadgrams.append(dd)
    with open('quadgram.csv', 'w') as ofp4:
        header = ['quadgram', 'count']
        dict_writer = csv.DictWriter(ofp4, header)
        dict_writer.writeheader()
        dict_writer.writerows(quadgrams)

This has been running for past 2 days on a 4 core machine. 它已经在4核计算机上运行了2天。 How can I make this more efficient (using pandas and/or multiprocessing, perhaps) and speed it up as reasonably as possible? 如何提高效率(也许使用熊猫和/或多处理程序)并尽可能合理地加快速度?

I would make a few changes: 我将做一些更改:

Find the bottleneck 找到瓶颈

What portion is taking so long? 哪一部分要花这么长时间?

  • Reading the CSV 读取CSV
  • Tokenizing 符号化
  • Making the n-grams 制作n克
  • Counting the n-grams 计数n克
  • Writing to disk 写入磁盘

So the first thing I would do is make a cleaner separation between the different steps, and ideally make it possible to restart half-way 因此,我要做的第一件事是在不同步骤之间进行更清晰的区分,并且理想情况下可以半途重启

Reading the text 阅读文字

I would extract this to a different method. 我将其提取为其他方法。 And from what I read ( here for example) pandas reads csv-files a lot quicker than csv . 从我阅读的内容(例如在这里 )中, pandas读取csv文件的速度比csv快得多。 If the reading of the csv only takes 1 minute of the 2 days, this might not be an issue, but I would do something like this: 如果读取csv只需要2天的1分钟,则可能不是问题,但是我会这样做:

def read_text(filename):  # you could add **kwarg to pass on to the read_csv
    df = pd.read_csv(filename) # add info on file encoding etc
    message = df['message'].str.replace(r'[\.]{2,}', r". ")  # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
    message = message.str.replace(r'([\.,;!?])([A-Z])', r'\1 \2')

    messame = message.strip()
    sentences = message.apply(nltk.tokenize.sent_tokenize)
    return ' '.join(sentences.appy(' '.join))

You can even do this in chunks, and yield instead of return the sentences to make it a generator, potentially saving on memory 您甚至可以分块执行此操作,并yield而不是返回语句以使其成为生成器,从而可能节省内存

Is there a specific reason you join the sentences after the sent_tokenize because I find this in the documentation 您是否有特定原因要在sent_tokenize之后加入句子,因为我在文档中找到了这一点

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. 如Penn Treebank一样,Treebank标记生成器使用正则表达式对文本进行标记化。 This is the method that is invoked by word_tokenize(). 这是word_tokenize()调用的方法。 It assumes that the text has already been segmented into sentences, eg using sent_tokenize(). 它假定文本已经被分割成多个句子,例如使用send_tokenize()。

So you would call it like this: 因此,您可以这样称呼它:

text = read_text(csv_file)
with open(text_file, 'w') as file:
    file.write(text)
print('finished reading text from file') # or use logging

Tokenizing 符号化

stays rougly the same 保持不变

tokens = nltk.word_tokenize(text)
print('finished tokenizing the text')

def save_tokens(filename, tokens):
    # save the list somewhere, either json or pickle, so you can pick up later if something goes wrong

making the n-grams, counting and writing them to disk 制作n-gram,计数并将其写入磁盘

Your code contains a lot of boilerplate which does the same thing with just a different function or filename, so instead I abstract this away in a list of tuples containing the name, the function to get the bigrams to count them and the filename to save 您的代码包含很多样板文件,它们只是用不同的函数或文件名来做相同的事情,因此,我将其抽象到包含名称的元组列表中,该函数用于获取二元数来对其进行计数,并保存文件名

ngrams = [
    ('bigrams', nltk.bigrams, collections.Counter, 'bigrams.csv'),
    ('trigrams', nltk.trigrams, collections.Counter, 'quadgrams.csv'),
    ('quadgrams', nltk.collocations.QuadgramCollocationFinder.from_words, parse_quadgrams, 'quadgrams.csv'),
]

If you want to count how much items are in a list, just use collections.Counter instead of making an (expensive) collection.OrderedDict on each item. 如果要计算列表中的项目数,只需使用collections.Counter而不是对每个项目进行(昂贵的) collection.OrderedDict If you want to do the counting yourself, better to use tuples than OrderedDict . 如果您想自己进行计数,最好使用元组而不是OrderedDict You could also use pd.Series.value_counts() 您也可以使用pd.Series.value_counts()

def parse_quadgrams(quadgrams):
    return quadgrams.ngram_fd #from what I see in the code this dict already contains the counts

for name, ngram_method, parse_method, output_file in ngrams:
    grams = ngram_method(tokens)
    print('finished generating ', name)
    # You could write this intermediate result to a temporary file in case something goes wrong
    count_df = pd.Series(parse_method(grams)).reset_index().rename(columns={'index': name, 0: 'count')
    # if you need it sorted you can do this on the DataFrame
    print('finished counting ', name)
    count_df.to_csv(output_file)
    print('finished writing ', name, ' to file: ', output_file)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM