Your Location is: Home > Scala

Loading data from a delimited .dat file and transforming it into columned dataframe scala

From: Vietnam View: 1740 Prophet 


My .dat file contains a custom header that is of format 'ABCYYYYMMDD' and footer of format 'A1234'. There is no column header.

The records are delimited by "|" and have 12 fields. To remove footer and header im using the following code:

val fileDF = sc.texfFile(filedirectory)

val total = fileDF.count()
val fileRdd = fileDF.zipWithIndex().filter(x=> x._2 != 0).filter(x => x._2 != total-1).map(x => x._1) 

After this if i try to split the data using

.map(x => x.split("|"))

each character of the string in each columns gets split too.

I want to ultimately convert the rdd into a dataframe and then perform a duplicate check on the combination of first and second column.

Best answer