Memory efficient way of keeping track of unique entries

Question

I have around a 50GB folder full of files. Each file consists of line after line of JSON data and in this JSON structure is a field for user_id .

I need to count the number of unique User IDs across all of the files (and only need the total count). What is the most memory efficient and relatively quick way of counting these?

Of course, loading everything into a huge list maybe isn't the best option. I tried pandas but it took quite a while. I then tried to simple write the IDs to text files but I thought I'd find out if I was maybe missing something far simpler.

Answer 1

Since you only need the user_id s, load a .json (as a data stucture), extract any id s, then destroy all references to that structure and any its parts so that it's garbage collected.

To speed up the process, you can do this in a few processes in parallel, take a look at multiprocessing.Pool.map .

Answer 2

Since it was stated that the JSON context of user_id does not matter , we just treat the JSON files as the pure text files they are.

GNU tools solution

I'd not use Python at all for this, but rather rely on the tools provided by GNU, and pipes:

cat *.json | sed -nE 's/\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*/\1/p' | sort -un --parallel=4 | wc -l

cat *.json : Output contents of all files to stdout
sed -nE 's/\\s*\\"user_id\\"\\s*\\:\\s*\\"([0-9]+)\\"\\s*/\\1/p' : Look for lines containting "user_id": "{number}" and only print the number to stdout
sort -un --parallel=4 : Sort the output numerically, ignoring duplicates (ie output only unique values), using multiple (4) jobs, and output to stdout
wc -l : Count number of lines, and output to stdout

To determine whether the values are unique, we just sort them. You can speed up the sorting by specifying a higher number of parallel jobs, depending on your core count.

Python solution

If you want to use Python nonetheless, I'd recommend using a set and re (regular expressions)

import fileinput
import re

r = re.compile(r'\s*\"user_id\"\s*\:\s*\"([0-9]+)\"\s*')

s = set()
for line in fileinput.input():
    m = r.match(line)
    if m:
        s.add(m.groups()[0])

print(len(s))

Run this using python3 <scriptname>.py *.json .

Answer 3

Try the simplest approach first.

Write a function get_user_ids(filepath) that returns a list of user_id in a JSON file.

Then do:

from pathlib import Path
the_folder = Path("path/to/the/folder")
user_ids = set()
for jsonpath in the_folder.glob('*.json'):
    user_ids.update(get_user_ids(jsonpath))
print(len(user_ids))

Answer 4

If the list of user IDs is so large that it can't reasonably fit into a set in memory, an easy and memory-efficient way to de-duplicate is to simply create files named after user IDs in an empty directory, and then count the number of files in the directory. This works because most filesystems are efficient at indexing file names in a directory.

import os
os.chdir('/')
os.mkdir('/count_unique')
os.chdir('/count_unique')
# change the following demo tuple to a generator that reads your JSON files and yields user IDs
for user_id in 'b', 'c', 'b', 'a', 'c':
    open(user_id, 'w').close()
print(sum(1 for _ in os.scandir('/count_unique')))

This outputs: 3

Memory efficient way of keeping track of unique entries

Question

4 answers

solution1
1 2019-04-12 21:07:58

solution2
1 ACCPTED 2019-04-12 21:17:25

GNU tools solution

Python solution

solution3
1 2019-04-12 21:36:07

solution4
0 2019-04-12 20:48:15

Memory efficient way of keeping track of unique entries

Question

4 answers

solution1 1 2019-04-12 21:07:58

solution2 1 ACCPTED 2019-04-12 21:17:25

GNU tools solution

Python solution

solution3 1 2019-04-12 21:36:07

solution4 0 2019-04-12 20:48:15

solution1
1 2019-04-12 21:07:58

solution2
1 ACCPTED 2019-04-12 21:17:25

solution3
1 2019-04-12 21:36:07

solution4
0 2019-04-12 20:48:15