Apache Spark: Convert CSV to RDD

Below is a simple Spark / Scala example describing how to convert a CSV file to an RDD and perform some simple filtering.
This example transforms each line in the CSV to a Map with form header-name -> data-value. Each map key corresponds to a header name, and each data value corresponds the value of that key the specific line.

This particular example also assumes that the header information is contained on the first line of the CSV file. In many cases, however, the headers are typically excluded from the data file.

Data File
Sample CSV contents

user, position, age
me, software developer, 35
jimmy, baker, 22
alice, ceo, 51

Spark / Scala Transform
CSV to RDD, then print the result. In this example, each val refers to an RDD.

    def main(args: Array[String]): Unit = {
      // Read the CSV file
      val csv = sc.textFile("/path/to/your/file.csv")
      // split / clean data
      val headerAndRows = csv.map(line => line.split(",").map(_.trim))
      // get header
      val header = headerAndRows.first
      // filter out header (eh. just check if the first val matches the first header name)
      val data = headerAndRows.filter(_(0) != header(0))
      // splits to map (header/value pairs)
      val maps = data.map(splits => header.zip(splits).toMap)
      // filter out the user "me"
      val result = maps.filter(map => map("user") != "me")
      // print result

Thank you!

You may also like...

3 Responses

  1. SparkNewbie! says:


  2. Srini says:

    Super Simple and generic code

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Get every new post on this blog delivered to your Inbox.

Join other followers: