Apache Spark: Convert CSV to RDD

Below is a simple Spark / Scala example describing how to convert a CSV file to an RDD and perform some simple filtering.
This example transforms each line in the CSV to a Map with form header-name -> data-value. Each map key corresponds to a header name, and each data value corresponds the value of that key the specific line.

This particular example also assumes that the header information is contained on the first line of the CSV file. In many cases, however, the headers are typically excluded from the data file.

Data File
Sample CSV contents
[code language=”Bash”]
user, position, age
me, software developer, 35
jimmy, baker, 22
alice, ceo, 51

Spark / Scala Transform
CSV to RDD, then print the result. In this example, each val refers to an RDD.
[code language=”Scala”]
def main(args: Array[String]): Unit = {
// Read the CSV file
val csv = sc.textFile("/path/to/your/file.csv")
// split / clean data
val headerAndRows = csv.map(line => line.split(",").map(_.trim))
// get header
val header = headerAndRows.first
// filter out header (eh. just check if the first val matches the first header name)
val data = headerAndRows.filter(_(0) != header(0))
// splits to map (header/value pairs)
val maps = data.map(splits => header.zip(splits).toMap)
// filter out the user "me"
val result = maps.filter(map => map("user") != "me")
// print result

Thank you!

You may also like...

3 Responses

  1. SparkNewbie! says:


  2. Srini says:

    Super Simple and generic code