Hive 执行动态插入分区时，在MapReduce日志中报“java.lang.OutOfMemoryError: GC overhead limit exceeded”错误

执行动态插入分区时，在MapReduce日志中报“java.lang.OutOfMemoryError: GC overhead limit exceeded”错误

在HiveServer服务正常的情况下，执行动态插入分区时，在MapReduce日志中报“java.lang.OutOfMemoryError: GC overhead limit exceeded”错误。

产生OOM的原因是单个任务处理的分区数过多，需要针对具体场景，减少单个task处理的分区数。

参照如下样例进行操作。

样例建表语句如下：

create table test(id int )partitioned by (dt int);

create table test1(id int, dt int);

正常的动态插入分区语句为：

insert overwrite table test partition (dt) select id, dt from test;

由于dt是分区字段，减少单个task处理分区数的办法是，将分区字段distribute到不同的task来处理。修改后的语句： insert overwrite table test partition (dt) select id, dt from test1 distribute by dt；
当distribute by的分区字段存在倾斜时，比如值为NULL的占了很大部分，那么还可以将其打散处理。存在倾斜字段为NULL时的优化后语句： insert overwrite table test partition (dt) select id, dt from test1 distribute by nvl(dt,round(rand()*50)); 说明： nvl函数是一个将null转换为需要的值的hive内置udf。内置udf的使用，可参考https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF。其中rand返回一个0-1之间的随机数，乘以一个常数50（也可以是其他数字，根据自己任务的并发度合理选取，以能在合理的时间处理完为宜）。然后通过round函数取整，就能够将值为NULL的分区，分散到多个不同的task中处理。