Tech and stuff
When working with Apache Spark I often like to test things out in my local so I don’t have to worry about compiling full project or trying things out through testing suite. Apache Spark comes with Spark Shell which is esentially Scala REPL with available instance of SparkSession which allows for quick exploration to test out your code.
But can we do better?
Ammonite is just like Scala REPL but on steroids! It has some very nice features like easier code navigation, importing external libraries and more. I’m not going to go over featurs of Ammonite here, you can learn more at Ammonite website .
Here’s the instructions on setting up your local Spark with Ammonite.
$ sudo sh -c '(echo "#!/usr/bin/env sh" && curl -L https://github.com/com-lihaoyi/Ammonite/releases/download/2.4.0/2.12-2.4.0) > /usr/local/bin/amm && chmod +x /usr/local/bin/amm' && amm
spark_session.sc
import ammonite.ops._
import $ivy.`org.apache.spark:spark-sql_2.12:2.4.4`
import $ivy.`org.apache.spark:spark-core_2.12:2.4.4`
import $ivy.`org.apache.spark:spark-avro_2.12:2.4.4`
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
$ amm
- execute in the same directory as spark_session.sc
import $exec.spark_session
That’s it! You can now start working with Spark.
Let’s edit spark_session.sc
file and add some additional code, for example helper methods to read data for common formats.
def loadAvro(path: String) = spark.read.format("avro").load(path)
def loadParuqet(path: String) = spark.read.format("parquet").load(path)
And use it.
$ amm
import $exec.spark_session
val df = loadParquet("/tmp/parquet_data")