执行Spark SQL语句时,出现joinedRow.isNullAt的空指针异常
现象描述
在执行Spark
SQL语句时,出现“joinedRow.isNullAt”的空指针异常,异常信息如下所示。
6/09/08 11:04:11 WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 10, vm1, 1): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:372)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:626)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:135)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:144)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:144)
at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:267)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:267)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:75)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:42)
at org.apache.spark.scheduler.Task.run(Task.scala:90)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
可能原因
由如下日志信息可知,该错误是由于内存不足,导致buffer在申请内存时申请失败返回为null,对null进行操作就返回了空指针异常。
当集群中内存相关的关键配置项的值设置的比较小时,例如设置为如下所示的值:
spark.executor.cores
= 8
spark.executor.memory
= 512M
spark.buffer.pageSize
= 16M
此时,执行任务会出现内存申请失败返回null的异常,关键日志如下:
6/09/08 11:04:11 WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 10, vm1, 1): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
定位思路
在使用Spark
SQL时,需要满足如下条件:
spark.executor.memory
* spark.shuffle.memoryFraction *spark.shuffle.safetyFraction / (num *
spark.executor.cores) > spark.buffer.pageSize
“spark.shuffle.memoryFraction”默认值为“0.2”。“spark.shuffle.safetyFraction”默认值为“0.8”。“spark.buffer.pageSize”默认值为“16M”。
常数num的经验取值为8,根据不同的SQL语句取值不同,每个task最多可以去申请16次pageSize,所以num的最大值为16。将公式中的参数num设置为16时,即可满足Spark
SQL出现问题的所有场景。但通常情况下8即能满足绝大多数的场景要求。
处理步骤
根据executor日志提示信息,您可以通过调整如下两个参数解决此问题。在客户端的“spark-defaults.conf”配置文件中调整如下参数。
- spark.executor.memory:增加executor的内存,即根据实际业务量,适当增大“spark.executor.memory”的参数值。需满足公式:spark.executor.memory
> spark.buffer.pageSize * (num * spark.executor.cores) /
spark.shuffle.memoryFraction / spark.shuffle.safetyFraction
- spark.executor.cores:减小executor的核数,即减小executor-cores的参数值。需满足公式:spark.executor.cores
< spark.executor.memory / spark.buffer.pageSize / num *
spark.shuffle.memoryFraction * spark.shuffle.memoryFraction。
在调整这两个参数时,需满足spark.executor.memory
* spark.shuffle.memoryFraction *spark.shuffle.safetyFraction / (num *
spark.executor.cores) >
spark.buffer.pageSize公式,在内存充足的情况下,建议直接将常数num设置为16,可解决所有场景遇到的内存问题。