[英]How to effectively read large (30GB+) TAR file with BZ2 JSON twitter files into PostgreSQL
I'm trying to obtain twitter data from the archive.org archive and load it into a database. 我正在尝试从archive.org存档中获取twitter数据并将其加载到数据库中。 I am attempting to first load all tweets for a specific month, to then make a selection for tweets and only stage those I'm interested in (eg by locale, or hashtag).
我试图首先加载特定月份的所有推文,然后选择推文并且只播放我感兴趣的那些(例如通过locale或hashtag)。
I am able to run the script described below to do what I'm looking for, but I have an issue in that it is incredibly slow. 我能够运行下面描述的脚本来完成我正在寻找的东西,但我有一个问题,那就是速度非常慢。 It has run for approximately a half hour and only read ~ 6 / 50,000 of the inner .bz2 files in one TAR file.
它运行了大约半小时,只读取了一个TAR文件中的~6 / 50,000内部.bz2文件。
Some stats of an example TAR file: 示例TAR文件的一些统计信息:
What should I be looking for when optimizing this process for speed? 在优化此流程以提高速度时,我应该寻找什么?
The script is currently using ~ 3% of my CPU and ~ 6% of my RAM memory. 该脚本目前占用了我3%的CPU和约6%的RAM内存。
Any help is greatly appreciated. 任何帮助是极大的赞赏。
import tarfile
import dataset # Using dataset as I'm still iteratively developing the table structure(s)
import json
import datetime
def scrape_tar_contents(filename):
"""Iterates over an input TAR filename, retrieving each .bz2 container:
extracts & retrieves JSON contents; stores JSON contents in a postgreSQL database"""
tar = tarfile.open(filename, 'r')
inner_files = [filename for filename in tar.getnames() if filename.endswith('.bz2')]
num_bz2_files = len(inner_files)
bz2_count = 1
print('Starting work on file... ' + filename[-20:])
for bz2_filename in inner_files: # Loop over all files in the TAR archive
print('Starting work on inner file... ' + bz2_filename[-20:] + ': ' + str(bz2_count) + '/' + str(num_bz2_files))
t_extract = tar.extractfile(bz2_filename)
data = t_extract.read()
txt = bz2.decompress(data)
tweet_errors = 0
current_line = 1
num_lines = len(txt.split('\n'))
for line in txt.split('\n'): # Loop over the lines in the resulting text file.
if current_line % 100 == 0:
print('Working on line ' + str(current_line) + '/' + str(num_lines))
try:
tweet = json.loads(line)
except ValueError, e:
error_log = {'Date_time': datetime.datetime.now(),
'File_TAR': filename,
'File_BZ2': bz2_filename,
'Line_number': current_line,
'Line': line,
'Error': str(e)}
tweet_errors += 1
db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
print('Error occured, now at ' + str(tweet_errors))
try:
tweet_id = tweet['id']
tweet_text = tweet['text']
tweet_locale = tweet['lang']
created_at = tweet['created_at']
tweet_json = tweet
data = {'tweet_id': tweet_id,
'tweet_text': tweet_text,
'tweet_locale': tweet_locale,
'created_at_str': created_at,
'date_loaded': datetime.datetime.now(),
'tweet_json': tweet_json}
db['tweets'].upsert(data, ['tweet_id'])
except KeyError, e:
error_log = {'Date_time': datetime.datetime.now(),
'File_TAR': filename,
'File_BZ2': bz2_filename,
'Line_number': current_line,
'Line': line,
'Error': str(e)}
tweet_errors += 1
db['error_log'].upsert(error_log, ['File_TAR', 'File_BZ2', 'Line_number'])
print('Error occured, now at ' + str(tweet_errors))
continue
if __name__ == "__main__":
with open("postgresConnecString.txt", 'r') as f:
db_connectionstring = f.readline()
db = dataset.connect(db_connectionstring)
filename = r'H:/Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar'
scrape_tar_contents(filename)
A tar file does not contain an index of where files are located. tar文件不包含文件所在位置的索引。 Moreover, a tar file can contain more than one copy of the same file .
此外,tar文件可以包含同一文件的多个副本 。 Therefore, when you extract one file, the entire tar file must be read .
因此,在提取一个文件时, 必须读取整个tar文件 。 Even after it finds the file, the rest of the tar file must still be read to check if a later copy exists.
即使在找到该文件之后,仍然必须读取其余的tar文件以检查是否存在以后的副本。
That makes extraction of one file as expensive as extracting all the files. 这使得提取一个文件与提取所有文件一样昂贵。
Therefore, never use tar.extractfile(...)
on a large tar file (unless you only need one file or don't have the space to extract everything). 因此,永远不要在大型tar文件上使用
tar.extractfile(...)
(除非您只需要一个文件或没有空间来提取所有内容)。
If you have the space (and given the size of modern hard drives, you almost certainly do), extract everything either with tar.extractall
or with a system call to tar xf ...
, and then process the extracted files. 如果你有空间(给定现代硬盘的大小,你几乎可以肯定),用
tar.extractall
或系统调用tar xf ...
提取所有内容,然后处理提取的文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.