简体   繁体   English

如何使用BZ2 JSON twitter文件有效地读取大型(30GB +)TAR文件到PostgreSQL中

[英]How to effectively read large (30GB+) TAR file with BZ2 JSON twitter files into PostgreSQL

I'm trying to obtain twitter data from the archive.org archive and load it into a database. 我正在尝试从archive.org存档中获取twitter数据并将其加载到数据库中。 I am attempting to first load all tweets for a specific month, to then make a selection for tweets and only stage those I'm interested in (eg by locale, or hashtag). 我试图首先加载特定月份的所有推文,然后选择推文并且只播放我感兴趣的那些(例如通过locale或hashtag)。

I am able to run the script described below to do what I'm looking for, but I have an issue in that it is incredibly slow. 我能够运行下面描述的脚本来完成我正在寻找的东西,但我有一个问题,那就是速度非常慢。 It has run for approximately a half hour and only read ~ 6 / 50,000 of the inner .bz2 files in one TAR file. 它运行了大约半小时,只读取了一个TAR文件中的~6 / 50,000内部.bz2文件。

Some stats of an example TAR file: 示例TAR文件的一些统计信息:

  • Total size: ~ 30-40GB 总大小:~30-40GB
  • Number of inner .bz2 files (arranged in folders): 50,000 内部.bz2文件的数量(排列在文件夹中):50,000
  • Size of one .bz2 file: ~600kb 一个.bz2文件的大小:~600kb
  • Size of one extracted JSON file: ~5 MB, ~3600 tweets. 一个提取的JSON文件的大小:~5 MB,~3600条推文。

What should I be looking for when optimizing this process for speed? 在优化此流程以提高速度时,我应该寻找什么?

  • Should I extract the files to disk instead of buffering them in Python? 我应该将文件解压缩到磁盘而不是在Python中缓存它们吗?
  • Should I look at multithreading a part of the process? 我应该看一下多线程的一部分过程吗? Which part of the process would be optimal for this? 该过程的哪一部分最适合这个?
  • Alternatively, is the speed I'm currently obtaining relatively normal for such a script? 或者,我目前获得的速度是否相对正常?

The script is currently using ~ 3% of my CPU and ~ 6% of my RAM memory. 该脚本目前占用了我3%的CPU和约6%的RAM内存。

Any help is greatly appreciated. 任何帮助是极大的赞赏。

import tarfile
import dataset # Using dataset as I'm still iteratively developing the table structure(s)
import json
import datetime


def scrape_tar_contents(filename):
    """Iterates over an input TAR filename, retrieving each .bz2 container:
       extracts & retrieves JSON contents; stores JSON contents in a postgreSQL database"""
    tar = tarfile.open(filename, 'r')
    inner_files = [filename for filename in tar.getnames() if filename.endswith('.bz2')]

    num_bz2_files = len(inner_files)
    bz2_count = 1
    print('Starting work on file... ' + filename[-20:])
    for bz2_filename in inner_files: # Loop over all files in the TAR archive
        print('Starting work on inner file... ' + bz2_filename[-20:] + ': ' + str(bz2_count) + '/' + str(num_bz2_files))
        t_extract = tar.extractfile(bz2_filename)
        data = t_extract.read()
        txt = bz2.decompress(data)

        tweet_errors = 0
        current_line = 1
        num_lines = len(txt.split('\n'))
        for line in txt.split('\n'):  # Loop over the lines in the resulting text file.
            if current_line % 100 == 0:
                print('Working on line ' + str(current_line) + '/' + str(num_lines))
                try:
                    tweet = json.loads(line)
                except ValueError, e:
                    error_log = {'Date_time': datetime.datetime.now(),
                                'File_TAR': filename,
                                'File_BZ2': bz2_filename,
                                'Line_number': current_line,
                                'Line': line,
                                'Error': str(e)}
                    tweet_errors += 1
                    db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
                    print('Error occured, now at ' + str(tweet_errors))
                try:
                    tweet_id = tweet['id']
                    tweet_text = tweet['text']
                    tweet_locale = tweet['lang']
                    created_at = tweet['created_at']
                    tweet_json = tweet
                    data = {'tweet_id': tweet_id,
                            'tweet_text': tweet_text,
                            'tweet_locale': tweet_locale,
                            'created_at_str': created_at,
                            'date_loaded': datetime.datetime.now(),
                            'tweet_json': tweet_json}
                    db['tweets'].upsert(data, ['tweet_id'])
                except KeyError, e:
                    error_log = {'Date_time': datetime.datetime.now(),
                                'File_TAR': filename,
                                'File_BZ2': bz2_filename,
                                'Line_number': current_line,
                                'Line': line,
                                'Error': str(e)}
                    tweet_errors += 1
                    db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
                    print('Error occured, now at ' + str(tweet_errors))
                    continue

if __name__ == "__main__":
    with open("postgresConnecString.txt", 'r') as f:
        db_connectionstring = f.readline()
    db = dataset.connect(db_connectionstring)

    filename = r'H:/Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar'
    scrape_tar_contents(filename)

A tar file does not contain an index of where files are located. tar文件不包含文件所在位置的索引。 Moreover, a tar file can contain more than one copy of the same file . 此外,tar文件可以包含同一文件的多个副本 Therefore, when you extract one file, the entire tar file must be read . 因此,在提取一个文件时, 必须读取整个tar文件 Even after it finds the file, the rest of the tar file must still be read to check if a later copy exists. 即使在找到该文件之后,仍然必须读取其余的tar文件以检查是否存在以后的副本。

That makes extraction of one file as expensive as extracting all the files. 这使得提取一个文件与提取所有文件一样昂贵。

Therefore, never use tar.extractfile(...) on a large tar file (unless you only need one file or don't have the space to extract everything). 因此,永远不要在大型​​tar文件上使用tar.extractfile(...) (除非您只需要一个文件或没有空间来提取所有内容)。

If you have the space (and given the size of modern hard drives, you almost certainly do), extract everything either with tar.extractall or with a system call to tar xf ... , and then process the extracted files. 如果你有空间(给定现代硬盘的大小,你几乎可以肯定),用tar.extractall或系统调用tar xf ...提取所有内容,然后处理提取的文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM