Setting Up Eclipse to run Spark using Scala


All the steps given below is written based on the Mac version of Eclipse.

As of this blog, following are the versions that I am using :
Platform :  MacOS
OS          : 10.12.5 (MacOS Sierra)
Eclipse   :  Mars.2 Release (4.5.2) Eclipse Java EE IDE for Web Developers.

Disclaimer : I am not talking about any best practice here. This blog is just to get your Eclipse ready to work with Spark with Scala.

1. Install Eclipse

2. Install Scala IDE.
    - Go to Help -> Eclipse Marketplace
    - Search for "Scala IDE" and install the "Scala IDE <version>"
    - After successful installation, you should see it being listed under 'Installed' tab.


3. Change the perspective to "Scala"


4. Create a new Scala Project and provide a name
    File -> New ->  Scala Project

5. Create a new Scala Object under the above project and provide a name.
  Right click the package -> New -> Scala Object

6. You can use a sample code provided under the following link (Thanks to meniluca)
https://github.com/meniluca/spark-scala-maven-boilerplate-project

I am copying it here just for ease of use :

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger

object Hello {

  def main(arg: Array[String]) {

    var logger = Logger.getLogger(this.getClass())

    if (arg.length < 2) {
      logger.error("=> wrong parameters number")
      System.err.println("Usage: MainExample <path-to-files> <output-path>")
      System.exit(1)
    }

    val jobName = "MainExample"

    val conf = new SparkConf().setAppName(jobName).setMaster("local[2]").set("spark.executor.memory", "1g");
    val sc = new SparkContext(conf)

    val pathToFiles = arg(0)
    val outputPath = arg(1)

    logger.info("=> jobName \"" + jobName + "\"")
    logger.info("=> pathToFiles \"" + pathToFiles + "\"")

    val files = sc.textFile(pathToFiles)
    val rowsWithoutSpaces = files.map(_.replaceAll(" ", ","))
    rowsWithoutSpaces.saveAsTextFile(outputPath)

  }
}

7.  To fix the build errors, you have to specify the dependencies. To do so :
Right click the project -> Properties ->  Libraries -> Add external Jar -> Locate & add Spark Jars.
(in my case, jars were located under /usr/local/Cellar/apache-spark/2.1.1/libexec/jars/)

8.  Right Click the object and select "Run As -> Run Configuration"
Give the main class name as "Hello" and fill in the Arguments as :
/tmp/input.txt (this input file should be available)
/tmp/out (output files will be generated under this dir).

9.  Run the code. Thats' it! Just check the output.

> cat /tmp/input.txt
twinkle twinkle little star
how i wonder what you are
up above the world so high
like a diamond in the sky

> ls /tmp/out/
_SUCCESS   part-00000 part-00001

> cat /tmp/out/*
twinkle,twinkle,little,star
how,i,wonder,what,you,are
up,above,the,world,so,high
like,a,diamond,in,the,sky

10. If you want to run as a executable jar, first export the jar by
  Right click the scala project -> Export -> Java -> JAR file
  Provide a Jar name (Ex: /tmp/scala-test.jar) and finish.

     Then to run the spark job, use the following command :
> spark-submit --class Hello --master local scala-test.jar /tmp/input.txt /tmp/out2
     Once this completes, you can check the output dir.
> ls /tmp/out2/
_SUCCESS   part-00000 part-00001

> cat /tmp/out2/*
twinkle,twinkle,little,star
how,i,wonder,what,you,are
up,above,the,world,so,high
like,a,diamond,in,the,sky

Good luck!


Comments

Popular posts from this blog

Accessing Hbase table via Hive.

Using Java API to access Google Search Console Data