简体   繁体   中英

Why is my Code Repo warning me not to use union and instead use unionByName?

I see in my repository it's warning me about using union and instead I should use unionByName . Aren't these the same thing? Why would I care which one to use?

In PySpark docs it's noted that for union :

Also as standard in SQL, this function resolves columns by position (not by name).

This is dangerous is most cases as if your schemas have the same types but not the same names / purposes, you may silently be merging different and incompatible schemas. ie if schema1 is [('col1', T.IntegerType()), ('col2', T.StringType())] and schema2 is [('col3', T.IntegerType()), ('col4', T.StringType())] , they can successfully be merged via union even though col1 and col3 have fundamentally different meanings, as may col2 and col4

This is different from unionByName , in that:

The difference between this function and union() is that this function resolves columns by name (not by position)

This is a safer way to conduct a union in most cases, therefore it is preferred.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM