Your Location is: Home > Scala

Spark Accumulator throws a class cast exception when trying to count the number of records in the dataset

From: Moldova, View: 1557 chintu 

Question

I am trying to count the number of records in my dataset. I am trying the below logic using accumulators.

    val accum = sc.longAccumulator("My_Accum")
    val fRDD = tempDS.rdd.persist(StorageLevel.MEMORY_AND_DISK).foreach(x=>{
      accum.add(1)
      x
    })

    val recordCount = accum.value

    println("record Count is : "+recordCount)

I am getting a class cast exception on line accum.add(1) as java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to packagename.CaseClassName.

With the same logic I am able to get the accumulator value in my previous step

Can someone please help me how to resolve this. Also is there any other way to count other than count() and accumulators.

Best answer

I don't think the problem is the accumulator, it seems that Spark can't convert your DataFrame, i.e. Dataset[Row] into a Dataset[packagename.CaseClassName] (which you don't show in your code).

Besides, this is a really unconventional way to count the number of rows, I would NOT recommend this. Fastest way is .count on your Dataset, using the RDD is slower in most cases