简体   繁体   中英

Issue storing data into BigQuery using a multi-threaded approach in Python

I am implementing a Python script to fetch existing user data from a Google BigQuery database, then using a multi-threaded approach to do some web scraping functionality for each user, and finally store the results in another table on BigQuery. There are around 3.6 million existing user records, and it takes at maximum 40 seconds to perform the scraping for each user. My goal is to be able to process 100,000 users per day, so that's why I need a concurrent processing approach.

I'm using the ThreadPoolExecutor from the concurrent.futures module. After a given number of threads are finished with their work, the executor is supposed to store the corresponding batch of results back in BigQuery. I see the threads continue to perform their web-scraping functionality. But after a certain amount of time (or with a large number of threads), they stop storing records back in the database.

At first, I think I was dealing with some race conditions relating to clearing the batch of results, but since then I've implemented a BoundedSemaphore from the threading module to implement a locking approach that I believe has solved the original issue. But the results still are not reliably being stored back in the database. So maybe I missed something?

I could use some help from someone who has lots of experience working with concurrent processing in Python. Specifically, I'm running the script on a Heroku server, so Heroku experience might be helpful as well. Thanks:! A snippet of my code is below:

service = BigQueryService() # a custom class defined elsewhere

users = service.fetch_remaining_users(min_id=MIN_ID, max_id=MAX_ID, limit=LIMIT) # gets users from BigQuery
print("FETCHED UNIVERSE OF", len(users), "USERS")

with ThreadPoolExecutor(max_workers=MAX_THREADS, thread_name_prefix="THREAD") as executor:
    batch = []
    lock = BoundedSemaphore()
    futures = [executor.submit(user_with_friends, row) for row in users]
    print("FUTURE RESULTS", len(futures))
    for index, future in enumerate(as_completed(futures)):
        #print(index)
        result = future.result()

        # OK, so this locking business:
        # ... prevents random threads from clearing the batch, which was causing results to almost never get stored, and
        # ... restricts a thread's ability to acquire access to the batch until another one has released it
        lock.acquire()
        batch.append(result)
        if (len(batch) >= BATCH_SIZE) or (index + 1 >= len(futures)): # when batch is full or is last
            print("-------------------------")
            print(f"SAVING BATCH OF {len(batch)}...")
            print("-------------------------")
            service.append_user_friends(batch) # stores the results in another table on BigQuery
            batch = []
        lock.release()

See also:

https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor

https://docs.python.org/3.7/library/threading.html#threading.BoundedSemaphore

So, I ended up using a different approach (see below) which works more reliably. The old approach coordinated between threads to store results, while the new one processes and stores a batch per thread.

def split_into_batches(all_users, batch_size=BATCH_SIZE):
    """h/t: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks"""
    for i in range(0, len(all_users), batch_size):
        yield all_users[i : i + batch_size]

def process_and_save_batch(user_rows, bq):
    print(generate_timestamp(), "|", current_thread().name, "|", "PROCESSING...")
    bq.append_user_friends([user_with_friends(user_row) for user_row in user_rows])
    print(generate_timestamp(), "|", current_thread().name, "|", "PROCESSED BATCH OF", len(user_rows))
    return True

service = BigQueryService() # a custom class defined elsewhere

users = service.fetch_remaining_users(min_id=MIN_ID, max_id=MAX_ID, limit=LIMIT)
print("FETCHED UNIVERSE OF", len(users), "USERS")

batches = list(split_into_batches(users))
print(f"ASSEMBLED {len(batches)} BATCHES OF {BATCH_SIZE}")

with ThreadPoolExecutor(max_workers=MAX_THREADS, thread_name_prefix="THREAD") as executor:

    for batch in batches:
        executor.submit(process_and_save_batch, batch, service)

When I significantly increase the thread count to a number like 2500, the script stops storing results hardly at all (a behavior I'd still like to investigate further), but I'm able to run it at relatively low thread counts and it is doing the job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM