简体繁体 English

为什么在Python线程中使用Global不好的做法？

[英]Why is using global in python threading bad practice?

原文 2018-07-26 22:08:24 8 1 python/ multithreading/ global

I read all over various websites how using global is bad. 我在各个网站上都读过，使用global不好。 I have an application where I am storing say, 300 objects, in an array. 我有一个应用程序，其中将300个对象存储在一个数组中。 I want to have 8 threads running through these 300 objects. 我想让8个线程贯穿这300个对象。 These objects are different sizes, say between 10 and 50,000 integers and randomly distributed (think worst case scenario here). 这些对象的大小不同，例如介于10到50,000之间的整数，并且是随机分布的（请在此处考虑最坏的情况）。

Basically, I want to start up 8 threads, do a process on an object, report or store the results, and pick up a new object, 300 times. 基本上，我想启动8个线程，对一个对象进行处理，报告或存储结果，并选择一个新对象300次。

The solution I can think of is to set a global lock and a global counter, lock the array, get the current object, increment the counter, release the lock. 我能想到的解决方案是设置一个全局锁和一个全局计数器，锁定数组，获取当前对象，增加计数器，释放锁。

There is 1 lock for 8 threads. 有1个8个线程的锁。 There is 1 counter for 8 threads. 有1个8线程计数器。 I have 2 global objects. 我有2个全局对象。 I store results in a dictionary, possibly also global to make it visible to all threads but also threadsafe. 我将结果存储在字典中，也可能是全局的，以使其对所有线程可见，但也对线程安全。 I am not bothering to do something stupid like subclassing thread and passing along 300/8 objects to each object because multiprocessing.pool does that for me. 我不喜欢做一些愚蠢的事情，例如子类化线程，并把300/8对象传递给每个对象，因为multiprocessing.pool为我做到了。 So how would you do it? 那你会怎么做呢？ Also, convince me that using global in this situation is bad. 另外，请说服我在这种情况下使用global是不好的。

1 个解决方案

Classifying approaches as either "good" or "bad" is a bit simplistic -- in practice, if a design makes sense to you and accomplishes the goals you set out to accomplish, then it doesn't matter whether other people (except possibly your boss) think it's "good" or not; 将方法归类为“好”或“不好”有点简单-在实践中，如果设计对您有意义并实现了您设定要实现的目标，那么其他人（可能是您老板）认为这是“好”或“不好”； it either works or it doesn't. 它要么起作用，要么不起作用。 On the other hand, if your design causes you a lot of pain and suffering, that's a sign that you might not be using the most suitable design for the task at hand. 另一方面，如果您的设计给您带来很多痛苦和折磨，那就表明您可能没有为当前任务使用最合适的设计。

That said, there are some valid reasons why a lot of people think that global variables are problematic, particularly when combined with multithreading. 就是说，很多人认为全局变量存在问题是有一定道理的，尤其是在与多线程结合使用时。

The general problem with global variables (with or without multithreading) is that as your program grows larger, it becomes increasingly difficult to mentally keep track of which parts of your program might be reading and/or updating the global variables' values at which times -- since they are global, by definition all parts of your program have access to them, so when you're trying to trace through your program to figure out who it was who set a global variable to some unexpected value, the list of suspects can become unmanageably large. 全局变量（带有或不带有多线程）的普遍问题是，随着程序的变大，越来越难以从心理上跟踪程序的哪个部分可能在何时读取和/或更新全局变量的值- -由于它们是全局的，因此从定义上来说，程序的所有部分都可以访问它们，因此，当您尝试遍历程序以找出是谁将全局变量设置为某些意外值时，可疑列表可以变得难以管理。 (this isn't much of a problem for small programs, but the larger your program grows, the worse this problem becomes -- and a lot of programmers have learned, through painful experience, that it's better to nip the problem in the bud by avoiding globals wherever possible in the first place, then to have to go back and rewrite a big, complicated, buggy program later on) （对于小型程序来说，这并不是什么大问题，但是程序的规模越大，这个问题就越糟-而且，许多程序员通过痛苦的经验中学到，最好通过以下方式解决问题：首先尽可能避免使用全局变量，然后稍后必须返回并重写一个大型，复杂，错误的程序）

In the specific use-case of a multithreaded program, the anybody-could-be-accessing-my-global-variable-at-any-time property becomes even more fraught with peril, since in a multithreaded scenario, any (non-immutable) data that is shared between threads can only be safely accessed with proper serialization (eg by locking a mutex before reading/writing the shared data, and unlocking it afterwards). 在多线程程序的特定用例中，任何人都可以随时访问我的全局变量的属性变得更加危险，因为在多线程方案中，任何（非不变的））只能通过适当的序列化安全地访问线程之间共享的数据（例如，通过在读取/写入共享数据之前锁定互斥体，然后在随后对其进行解锁）。 Ideally programmers would never accidentally read or write any shared+mutable data without locking the mutex -- but programmers are human and will inevitably make mistakes; 理想情况下，程序员必须在不锁定互斥体的情况下永远不会意外地读取或写入任何共享的+可变数据-但是程序员是人为的，并且不可避免地会犯错误。 if given the ability to do so, sooner or later you (or someone else) will forget that access to a particular global variable needs to be serialized, and will just go ahead and read/write it, and then you're in for a lot of pain, because the symptoms will be rare and random, and the cause of the fault won't be obvious. 如果有能力这样做，您（或其他人）迟早会忘记对特定全局变量的访问需要进行序列化，而只是继续进行读/写操作，然后您就可以很多痛苦，因为症状很少见且随机出现，并且故障原因也不明显。

So smart programmers try to make it impossible to fall into that sort of trap, usually by limiting access to the shared-state to a specific, small, carefully-written set of functions (aka an API) that implement the serialization correctly so that no other code has to. 因此，精明的程序员通常试图通过限制对共享状态的访问，将其限制为一组特定的，精心编写的，可以正确实现序列化的函数（也称为API），从而避免陷入这种陷阱。其他代码必须。 When doing that, you want to make sure that only the code in this particular API has access to the shared data, and that nobody else does -- something that is impossible to do with a global variable, as by definition everyone has direct access to it. 这样做时，您要确保只有该特定API中的代码才能访问共享数据，并且其他人没有访问权限-这与全局变量是不可能的，因为按照定义，每个人都可以直接访问它。

There is also one performance-related reason why people prefer not to mix global variables and multithreading: the more serialization you have to do, the less your program can exploit the power of multiple CPU cores. 人们不喜欢混合使用全局变量和多线程还有一个与性能相关的原因：序列化越多，程序利用多个CPU内核的能力就越少。 In particular, it does you no good to have an 8-core CPU if 7 of your 8 threads are spending most of their time blocked, waiting for a mutex to become available. 特别是，如果8个线程中有7个线程大部分时间都被阻塞，等待互斥体可用，那么拥有8核CPU并不合适。

So how does that relate to globals? 那么，这与全局变量有何关系？ It's related in that in most cases it's difficult or impossible to prove that a global variable won't ever be accessed by another thread, which means all accesses to that global variable need to be serialized. 与此相关的是，在大多数情况下，很难或不可能证明一个全局变量永远不会被另一个线程访问，这意味着所有对该全局变量的访问都需要序列化。 With a non-global variable, on the other hand, you can make sure to give a reference to that variable to only a single thread -- at which point you have effectively guaranteed that only that one thread will ever access the variable (since the other threads have no references to it, you know they can't access it), and because you have that guarantee, you no longer need to serialize access to that data, and now your thread can run more efficiently because it never has to block waiting for a mutex. 另一方面，对于非全局变量，您可以确保仅对单个线程提供对该变量的引用-此时，您已经有效地保证只有一个线程可以访问该变量（因为其他线程没有对此的引用，您知道它们无法访问它），并且由于有了保证，您不再需要序列化对该数据的访问，现在您的线程可以更高效地运行，因为它不必阻塞等待互斥。

(Btw note that CPython in particular suffers from a severe form of implicit serialization caused by Python's Global Interpreter Lock , which means that even the best multithreaded, CPU-bound Python code will be unlikely to use more than a single CPU core at a time. The only way to get around that is to use multiprocessing instead, or do the bulk of your program's computations in a lower-level language such C, so that it can execute without holding the GIL) （顺便说一句，特别是CPython由于Python的Global Interpreter Lock导致了严重的隐式序列化，这意味着即使是最好的多线程，受CPU约束的Python代码也不太可能一次使用多个CPU内核。解决该问题的唯一方法是改用多处理，或者使用较低级的语言（例如C）执行程序的大量计算，以便无需持有GIL就可以执行该程序）