Apache Spark: Convert CSV to RDD

Below is a simple Spark / Scala example describing how to convert a CSV file to an RDD and perform some simple filtering.
This example transforms each line in the CSV to a Map with form header-name -> data-value. Each map key corresponds to a header name, and each data value corresponds the value of that key the specific line.

This particular example also assumes that the header information is contained on the first line of the CSV file. In many cases, however, the headers are typically excluded from the data file.

Data File

Sample CSV contents

user, position, age
me, software developer, 35
jimmy, baker, 22
alice, ceo, 51

Spark / Scala Transform
CSV to RDD, then print the result. In this example, each val refers to an RDD.

    def main(args: Array[String]): Unit = {
      // Read the CSV file
      val csv = sc.textFile("/path/to/your/file.csv")
      // split / clean data
      val headerAndRows = csv.map(line => line.split(",").map(_.trim))
      // get header
      val header = headerAndRows.first
      // filter out header (eh. just check if the first val matches the first header name)
      val data = headerAndRows.filter(_(0) != header(0))
      // splits to map (header/value pairs)
      val maps = data.map(splits => header.zip(splits).toMap)
      // filter out the user "me"
      val result = maps.filter(map => map("user") != "me")
      // print result
      result.foreach(println)
    }

Thank you!

You may also like...

3 Responses

  1. Aamir says:

    What should be done when the columns are not labeled? Say you want to filter column number 2 where the string does not contain ceo

  2. SparkNewbie! says:

    Output?

Leave a Reply