[英]Extracting specific lines from a Large file
我有一个格式大的文件(5,000,000行):
'User ID,Mov ID,Rating,Timestamp'
我还有另一个文件(200,000行),其编号要少得多。 格式的记录:
'User ID, Mov ID'
我必须生成一个新文件,以便如果第二个文件中的(用户ID,Mov ID)与第一个文件的5,000,000行中的任何记录匹配,则我不应该将其包含在新文件中。 换句话说,新文件由唯一的用户ID,Mov ID组成,就这一点而言,它与file2(200,000行)没有任何共同之处(用户ID,Mov ID)
我正在尝试这种幼稚的方法,但是它花费了太多时间。 是否有实施更快的算法?:
from sys import argv
import re
script, filename1, filename2 = argv
#open files
testing_small= open(filename1)
ratings=open(filename2)
##Open file to write thedata
ratings_training=open("ratings_training.csv",'w')
for line_rating in ratings:
flag=0;testing_small.seek(0)
for line_test in testing_small:
matched_line=re.match(line_test.rstrip(),line_rating)
if matched_line:
flag=1;break
if(flag==0):
ratings_training.write(line_rating)
testing_small.close()
ratings.close()
ratings_training.close()
我也可以使用任何基于火花的方法
例如:
# df1:
User_ID,Mov_ID,Rating,Timestamp
sam,apple,0.6,2017-03-17 09:04:39
sam,banana,0.7,2017-03-17 09:04:39
tom,apple,0.3,2017-03-17 09:04:39
tom,pear,0.9,2017-03-17 09:04:39
# df2:
User_ID,Mov_ID
sam,apple
sam,pear
tom,apple
在大熊猫中:
import pandas as pd
df1 = pd.read_csv('./disk_file')
df2 = pd.read_csv('./tmp_file')
res = pd.merge(df1, df2, on=['User_ID', 'Mov_ID'], how='left', indicator=True)
res = res[res['_merge'] == 'left_only']
print(res)
还是火花:
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()
df1 = spark.read.load(path='file:///home/zht/PycharmProjects/test/disk_file', format='csv', sep=',', header=True)
df2 = spark.read.load(path='file:///home/zht/PycharmProjects/test/tmp_file', format='csv', sep=',', header=True)
res = df1.join(df2, on=[df1['User_ID'] == df2['User_ID'], df1['Mov_ID'] == df2['Mov_ID']], how='left_outer')
res = res.filter(df2['User_ID'].isNotNull())
res.show()
您应该将较小的文件永久保存在内存中。 那么您可以逐行处理大文件而不保存整个文件。
代码未经测试:
# read the smaller filter file
filter = set()
with open(reffile, "rt") as f:
for line in f:
user, movie = line.strip().split(",")
filter.add((user, movie))
# test the large file an write filtered data
with open(outfile, "wt") as f_out:
with open(bigfile, "rt") as f_in:
for line in f_in:
user, movie, _, _ = line.strip().split(",")
if (user, movie) not in filter:
print(",".join((user, movie)), file=f_out)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.