简体   繁体   中英

In Python, should I use threading for this?

I'm working on a small program that collects certain data and processes it. Currently, the program runs constantly on my server, storing the data to the disk. Every once in a while I run another program that reads the stored data, processes it, sorts it, saves it to a new location, and clears the old data files.

I've never learned about threads, but it SOUNDS like this is a good place to use them? If threading works the way I think it does, I could set up a queue to hold the data, and have a separate thread that could pull data from the queue and process it as it's ready. If the queue is full, thread1 could sleep for a bit. If it's empty, thread2 could sleep for a bit

That would reduce disk writing, get rid of disk reading, and make the data collection run side-by-side with the data processing to save time.

Is any of this accurate? I'm a senior CS student and threads have never once come up (Surely that's a little odd?). I would appreciate any tips/knowledge/advice with using threads, as well as if this is the correct solution to my "problem".

Thanks!

That does sound like a situation where some form of parallelism might be useful. However, since this is Python, you might not want to actually use threads. Python, in the standard implementation, has something called the Global Interpreter Lock. Effectively, to allow the garbage collector to work, only one thread of a Python program can actually be running Python code at any time (modules written directly in C, or external operations such as disk IO or database queries, are not "running Python code" for this purpose, though you will have called them from Python).

Because of this, threading in Python is generally only a good idea if your Python code spends significant amounts of time waiting on responses from non-Python parts of the program, or external sources. If the data collection or processing is being done outside Python (collecting from a database or website, processing in numpy, etc.), that might be reasonable. If your code isn't in this situation often enough, your program ends up wasting more time switching between threads than it gains (because if two threads are both in Python code it still only runs one at a time)

If not, you should try the multiprocessing module instead. This is also a generally safer model, as the only things that can be shared between processes in multiprocessing are the things you explicitly share (while threads share all state, potentially allowing one thread to break another because you forgot to lock something).

Alternately, you might use subprocess . Effectively, this would be having your first program intermittently restart the second one every time it finishes a batch of data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM