Apache Spark: Convert CSV to RDD
Below is a simple Spark / Scala example describing how to convert a CSV
file to an RDD
and perform some simple filtering.
This example transforms each line in the CSV
to a Map
with form header-name -> data-value
. Each map key corresponds to a header name, and each data value corresponds the value of that key the specific line.
This particular example also assumes that the header information is contained on the first line of the CSV file. In many cases, however, the headers are typically excluded from the data file.
Data File
Sample CSV contents
user, position, age me, software developer, 35 jimmy, baker, 22 alice, ceo, 51
Spark / Scala Transform
CSV to RDD, then print the result. In this example, each val
refers to an RDD.
def main(args: Array[String]): Unit = { // Read the CSV file val csv = sc.textFile("/path/to/your/file.csv") // split / clean data val headerAndRows = csv.map(line => line.split(",").map(_.trim)) // get header val header = headerAndRows.first // filter out header (eh. just check if the first val matches the first header name) val data = headerAndRows.filter(_(0) != header(0)) // splits to map (header/value pairs) val maps = data.map(splits => header.zip(splits).toMap) // filter out the user "me" val result = maps.filter(map => map("user") != "me") // print result result.foreach(println) }
Thank you!
Thank you!
Output?
Super Simple and generic code