如何使用BZ2 JSON twitter文件有效地读取大型（30GB +）TAR文件到PostgreSQL中

Question

I'm trying to obtain twitter data from the archive.org archive and load it into a database. 我正在尝试从archive.org存档中获取twitter数据并将其加载到数据库中。 I am attempting to first load all tweets for a specific month, to then make a selection for tweets and only stage those I'm interested in (eg by locale, or hashtag). 我试图首先加载特定月份的所有推文，然后选择推文并且只播放我感兴趣的那些（例如通过locale或hashtag）。

I am able to run the script described below to do what I'm looking for, but I have an issue in that it is incredibly slow. 我能够运行下面描述的脚本来完成我正在寻找的东西，但我有一个问题，那就是速度非常慢。 It has run for approximately a half hour and only read ~ 6 / 50,000 of the inner .bz2 files in one TAR file. 它运行了大约半小时，只读取了一个TAR文件中的~6 / 50,000内部.bz2文件。

Some stats of an example TAR file: 示例TAR文件的一些统计信息：

Total size: ~ 30-40GB 总大小：~30-40GB
Number of inner .bz2 files (arranged in folders): 50,000 内部.bz2文件的数量（排列在文件夹中）：50,000
Size of one .bz2 file: ~600kb 一个.bz2文件的大小：~600kb
Size of one extracted JSON file: ~5 MB, ~3600 tweets. 一个提取的JSON文件的大小：~5 MB，~3600条推文。

What should I be looking for when optimizing this process for speed? 在优化此流程以提高速度时，我应该寻找什么？

Should I extract the files to disk instead of buffering them in Python? 我应该将文件解压缩到磁盘而不是在Python中缓存它们吗？
Should I look at multithreading a part of the process? 我应该看一下多线程的一部分过程吗？ Which part of the process would be optimal for this? 该过程的哪一部分最适合这个？
Alternatively, is the speed I'm currently obtaining relatively normal for such a script? 或者，我目前获得的速度是否相对正常？

The script is currently using ~ 3% of my CPU and ~ 6% of my RAM memory. 该脚本目前占用了我3％的CPU和约6％的RAM内存。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

import tarfile
import dataset # Using dataset as I'm still iteratively developing the table structure(s)
import json
import datetime


def scrape_tar_contents(filename):
    """Iterates over an input TAR filename, retrieving each .bz2 container:
       extracts & retrieves JSON contents; stores JSON contents in a postgreSQL database"""
    tar = tarfile.open(filename, 'r')
    inner_files = [filename for filename in tar.getnames() if filename.endswith('.bz2')]

    num_bz2_files = len(inner_files)
    bz2_count = 1
    print('Starting work on file... ' + filename[-20:])
    for bz2_filename in inner_files: # Loop over all files in the TAR archive
        print('Starting work on inner file... ' + bz2_filename[-20:] + ': ' + str(bz2_count) + '/' + str(num_bz2_files))
        t_extract = tar.extractfile(bz2_filename)
        data = t_extract.read()
        txt = bz2.decompress(data)

        tweet_errors = 0
        current_line = 1
        num_lines = len(txt.split('\n'))
        for line in txt.split('\n'):  # Loop over the lines in the resulting text file.
            if current_line % 100 == 0:
                print('Working on line ' + str(current_line) + '/' + str(num_lines))
                try:
                    tweet = json.loads(line)
                except ValueError, e:
                    error_log = {'Date_time': datetime.datetime.now(),
                                'File_TAR': filename,
                                'File_BZ2': bz2_filename,
                                'Line_number': current_line,
                                'Line': line,
                                'Error': str(e)}
                    tweet_errors += 1
                    db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
                    print('Error occured, now at ' + str(tweet_errors))
                try:
                    tweet_id = tweet['id']
                    tweet_text = tweet['text']
                    tweet_locale = tweet['lang']
                    created_at = tweet['created_at']
                    tweet_json = tweet
                    data = {'tweet_id': tweet_id,
                            'tweet_text': tweet_text,
                            'tweet_locale': tweet_locale,
                            'created_at_str': created_at,
                            'date_loaded': datetime.datetime.now(),
                            'tweet_json': tweet_json}
                    db['tweets'].upsert(data, ['tweet_id'])
                except KeyError, e:
                    error_log = {'Date_time': datetime.datetime.now(),
                                'File_TAR': filename,
                                'File_BZ2': bz2_filename,
                                'Line_number': current_line,
                                'Line': line,
                                'Error': str(e)}
                    tweet_errors += 1
                    db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
                    print('Error occured, now at ' + str(tweet_errors))
                    continue

if __name__ == "__main__":
    with open("postgresConnecString.txt", 'r') as f:
        db_connectionstring = f.readline()
    db = dataset.connect(db_connectionstring)

    filename = r'H:/Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar'
    scrape_tar_contents(filename)

Answer 1

A tar file does not contain an index of where files are located. tar文件不包含文件所在位置的索引。 Moreover, a tar file can contain more than one copy of the same file . 此外，tar文件可以包含同一文件的多个副本。 Therefore, when you extract one file, the entire tar file must be read . 因此，在提取一个文件时， 必须读取整个tar文件 。 Even after it finds the file, the rest of the tar file must still be read to check if a later copy exists. 即使在找到该文件之后，仍然必须读取其余的tar文件以检查是否存在以后的副本。

That makes extraction of one file as expensive as extracting all the files. 这使得提取一个文件与提取所有文件一样昂贵。

Therefore, never use tar.extractfile(...) on a large tar file (unless you only need one file or don't have the space to extract everything). 因此，永远不要在大型tar文件上使用tar.extractfile(...) （除非您只需要一个文件或没有空间来提取所有内容）。

If you have the space (and given the size of modern hard drives, you almost certainly do), extract everything either with tar.extractall or with a system call to tar xf ... , and then process the extracted files. 如果你有空间（给定现代硬盘的大小，你几乎可以肯定），用tar.extractall或系统调用tar xf ...提取所有内容，然后处理提取的文件。

如何使用BZ2 JSON twitter文件有效地读取大型（30GB +）TAR文件到PostgreSQL中

问题描述

1 个解决方案

解决方案1
9 已采纳 2015-01-08 12:32:19

如何使用BZ2 JSON twitter文件有效地读取大型（30GB +）TAR文件到PostgreSQL中

问题描述

1 个解决方案

解决方案1 9 已采纳 2015-01-08 12:32:19

解决方案1
9 已采纳 2015-01-08 12:32:19