使用JPA将包括关系的整个表加载到内存中

Question

I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them. 我必须处理分布在20个表中的大量数据（总结约500万条记录），我需要有效地加载它们。

I'm using Wildfly 14 and JPA/Hibernate. 我正在使用Wildfly 14和JPA / Hibernate。

Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply: 最后，每个记录都将被业务逻辑使用（在同一个事务中），我决定通过以下方式将所需表的全部内容预先加载到内存中：

em.createQuery("SELECT e FROM Entity e").size();

After that, every object should be availabe in the transaction and thus be available via: 之后，每个对象都应该在事务中可用，因此可以通过以下方式获得：

em.find(Entity.class, id);

But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships. 但是这在某种程度上不起作用，并且仍然有很多对DB的调用，特别是对于关系。

How can I efficiently load the whole content of the required tables including the relationships and make sure I got everything / there will be no further DB calls? 我怎样才能有效地加载所需表格的全部内容，包括关系，并确保我得到了所有内容/没有进一步的数据库调用？

What I already tried: 我已经尝试过的：

FetchMode.EAGER: Still too many single selects / object graph too complex FetchMode.EAGER：仍有太多的单选/对象图太复杂了
EntityGraphs: Same as FetchMode.EAGER EntityGraphs ：与FetchMode.EAGER相同
Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities 加入fetch语句：迄今为止的最佳结果，因为它同时填充了与引用实体的关系
2nd Level / Query Cache: Not working, probably the same problem as em.find 第二级/查询缓存：不工作，可能与em.find问题相同

One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions. 需要注意的一点是，数据是不可变的（至少在特定时间内），也可以用于其他事务。

Edit: 编辑：

My plan is to load and manage the entire data in a @Singleton bean. 我的计划是在@Singleton bean中加载和管理整个数据。 But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. 但我想确保以最有效的方式加载它并确保加载整个数据。 There should be no further queries necessary when the business logic is using the data. 当业务逻辑使用数据时，不应该有进一步的查询。 After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables ). 在特定时间（ejb计时器）之后，我将丢弃整个数据并从DB重新加载当前状态（总是整个表 ）。

Answer 1

Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. 请记住，您可能需要64位JVM和大量内存。 Take a look at Hibernate 2nd Level Cache . 看看Hibernate二级缓存。 Some things to check for since we don't have your code: 由于我们没有您的代码，因此需要检查一些事项：

@Cacheable annotation will clue Hibernate in so that the entity is cacheable @Cacheable注释将提示Hibernate，以便实体可缓存
Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it 配置二级缓存以使用类似ehcache的东西，并将最大内存元素设置为足够大的内容以适合您的工作集
Make sure you're not accidentally using multiple sessions in your code. 确保您不会在代码中意外使用多个会话。

If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. 如果您需要以这种方式处理事物，您可能需要考虑将设计更改为不依赖于内存中的所有内容，不使用Hibernate / JPA，或者不使用app服务器。 This will give you more control of how things are executed. 这将使您更好地控制事物的执行方式。 This may even be a better fit for something like Hadoop. 这甚至可能更适合Hadoop之类的东西。 Without more information it's hard to say what direction would be best for you. 没有更多信息，很难说哪个方向最适合您。

Answer 2

I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. 我明白你在问什么，但JPA / Hibernate不想为你缓存那么多数据，或者至少我不希望得到它的保证。 Consider that you described 5 million records. 考虑一下你描述了500万条记录。 What is the average length per record? 每条记录的平均长度是多少？ 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. 100字节给出了500兆字节的内存，这只会让你的未经破坏的JVM崩溃。 Probably more like 5000 bytes average and that's 25 gB of memory. 可能更像是5000字节的平均值和25 gB的内存。 You need to think about what you're asking for. 你需要考虑你要求的东西。

If you want it cached you should do that yourself or better yet just use the results when you have them. 如果你想要它被缓存你应该自己或更好地做，但只要你有它们时使用结果。 If you want a memory based data access you should look at a technology specifically for that. 如果您想要基于内存的数据访问，您应该专门研究一种技术。 http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first. http://www.ehcache.org/似乎很受欢迎，但这取决于您，您应该确保首先了解您的用例。

If you are trying to be database efficient then you should just understand what your doing and design and test carefully. 如果您想要提高数据库效率，那么您应该了解您的工作和设计并仔细测试。

Answer 3

Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example. 基本上，使用每个表一个查询加载整个表并链接对象应该是一个非常容易的任务，但JPA的工作方式不同，如本例所示。

The biggest problem are @OneToMany / @ManyToMany -relations: 最大的问题是@OneToMany / @ManyToMany -relations：

@Entity
public class Employee {
    @Id
    @Column(name="EMP_ID")
    private long id;
    ...
    @OneToMany(mappedBy="owner")
    private List<Phone> phones;
    ...
}
@Entity
public class Phone {
    @Id
    private long id;    
    ...
    @ManyToOne
    @JoinColumn(name="OWNER_ID")
    private Employee owner;
    ...
}

FetchType.EAGER FetchType.EAGER

If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? 如果定义为FetchType.EAGER并且查询SELECT e FROM Employee e Hibernate生成SQL语句SELECT * FROM EMPLOYEE并且SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem . 对于每个单独的Employee ，通常称为1 + n问题 。

I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones , which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID . 我可以通过使用JPQL查询SELECT e FROM Employee e JOIN FETCH e.phones来避免n + 1问题，这将导致类似SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID 。

The problem is, this won't work for a complex data model with ~20 tables involved. 问题是，这对于涉及约20个表的复杂数据模型不起作用。

FetchType.LAZY FetchType.LAZY

If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones , which in the end will lead into the 1+n problem as well. 如果定义为FetchType.LAZY则查询SELECT e FROM Employee e将仅将所有Employees加载为Proxies，仅在访问phones时加载相关的phones ，这最终将导致1 + n问题。

To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p . 为了避免这种情况，将所有电话加载到同一会话SELECT p FROM Phone p是非常明显的。 But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=? 但是当访问phones Hibernate仍会执行SELECT * FROM PHONE WHERE OWNER_ID=? , because Hibernate doesn't know that there are already all Phones in its current session. ，因为Hibernate不知道当前会话中已经存在所有电话。

Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID . 即使使用二级缓存，该语句也将在DB上执行，因为Phone在第二级缓存中由其主键索引，而不是由OWNER_ID 。

Conclusion 结论

There is no mechanism like "just load all data" in Hibernate. 在Hibernate中没有像“只加载所有数据”这样的机制。

It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC. 似乎除了保持关系瞬态并手动连接它们甚至只使用普通的旧JDBC之外别无他法。

EDIT: 编辑：

I just found a solution which works very well. 我刚刚找到了一个非常有效的解决方案。 I defined all relevant @ManyToMany and @OneToMany as FetchType.EAGER combinded with @Fetch(FetchMode.SUBSELECT) and all @ManyToOne with @Fetch(FetchMode.JOIN) , which results in an acceptable loading time. 我将所有相关的@ManyToMany和@OneToMany定义为FetchType.EAGER与@Fetch(FetchMode.SUBSELECT)和所有@ManyToOne与@Fetch(FetchMode.JOIN) ，这导致可接受的加载时间。 Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. 在将javax.persistence.Cacheable(true)添加到所有实体之后，我将org.hibernate.annotations.Cache添加到每个相关集合，这将在二级缓存中启用集合缓存。 I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via @Singleton EJB combined with @Startup on server start / deploy. 我在服务器启动/部署时通过@Singleton EJB和@Startup禁用了第二级缓存超时驱逐和“预热”二级缓存。 Now I have 100% control over the cache, there are no further DB calls until I manually clear it. 现在我100％控制缓存，在手动清除之前没有进一步的DB调用。

使用JPA将包括关系的整个表加载到内存中

问题描述

Edit: 编辑：

3 个解决方案

解决方案1
6 2018-11-12 18:49:02

解决方案2
5 2018-10-29 17:26:35

解决方案3
2 已采纳 2018-11-16 07:03:14

使用JPA将包括关系的整个表加载到内存中

问题描述

Edit: 编辑：

3 个解决方案

解决方案1 6 2018-11-12 18:49:02

解决方案2 5 2018-10-29 17:26:35

解决方案3 2 已采纳 2018-11-16 07:03:14

解决方案1
6 2018-11-12 18:49:02

解决方案2
5 2018-10-29 17:26:35

解决方案3
2 已采纳 2018-11-16 07:03:14