Jul 14, 2015

Hello Spark - How to configure Spark development environment on Windows with IntelliJ

1. Install prerequisites

a) JDK 1.7

Download it here

b) Scala 2.10.5

Download it here

3) Maven latest version

Download it here

4) SBT

Download it here

After install or unzip these components, configure them in the system variables, for example:
JAVA_HOME -->  C:\Program Files\Java\jdk1.7.0_60
MAVEN_HOME --> D:\Program Files\Dev\apache-maven-3.2.3
SCALA_HOME --> C:\Program Files (x86)\scala
SBT_HOME --> C:\Program Files (x86)\sbt\

and append the following items in PATH
%JAVA_HOME%\bin; %MAVEN_HOME%\bin;%SCALA_HOME%\bin;%SBT_HOME%\bin;


2. Download and setup IntelliJ IDE

a) Download IntelliJ here

b) Download Scala plugin


Note: it's mandatory to enable HTTP proxy if you are in a company firewall.

c) Configure Maven settings if under proxy


We can copy a settings.xml file from %MAVEN_HOME%\conf\settings.xml, and when it's under proxy, please copy it to C:\Users\*****\.m2 , and enable proxy
After modification:


3. Create project

Let's create a sample project called HelloSpark

1) File --> New --> Project..., please choose Scala



After click Finish, it will pop up a dialog, we can choose "New Window"

2). Add maven support

Right click project name and choose "Add Framework Support...", please scroll down and select "Maven"

Double click pom.xml and add the following content with existing content of pom.xml


    
        
        
            maven-compiler-plugin
            3.1
            
                1.7
                1.7
            
        
        
    

       
        
        
            org.apache.spark
            spark-core_2.10
            1.4.0
            provided
        
        
        
            org.apache.spark
            spark-mllib_2.10
            1.4.0
            provided
        

        
        
            org.apache.spark
            spark-sql_2.10
            1.4.0
        

        
        
            org.apache.spark
            spark-hive_2.10
            1.4.0
        

        
        
            org.apache.spark
            spark-streaming_2.10
            1.4.0
            provided
        
        
            org.apache.kafka
            kafka_2.10
            0.8.1
            provided
        
        
            org.apache.spark
            spark-streaming-kafka_2.10
            1.4.0
        
    
 
    
        
            Maven
            http://repo1.maven.org/maven2
        
        
            clojars
            http://clojars.org/repo/
        
        
            m2.java.net
            Java.net Maven 2 Repository
            http://download.java.net/maven/2
            default
        
    



Hello
A sample pom.xml can be viewed here

After pasted the content, on the top right it will pop up a dialog, please choose Enable Auto-Import and maven will start downloading specified dependencies.


Or you can do it via
right click project name--> Maven --> Reimport

3) Create a folder for scala

expand project file structure,  src--> main, right click main, New--> Directory,
name it as Scala

Then add this new folder "Scala" to project source
File--> Project Structure (shortcut Ctrl+Alt+Shift+S)

Modules--> scala -->Source , and as the screenshot shows, Click 1, 2 and 3, the result will display 4.

4) Create a scala class

Right click scala folder, new Scala class

Add modify the content as the screenshot.
Also please be noted
1) org.apache.spark.SparkContext need be imported.
2) create a file called pagecounts,
3) This program is to read the content from a file named pagecounts, and then print out the first 10 lines, and also print out the total line counts of this file.
You can put arbitrary content in pagecounts, a sample file can be viewed here. If you place in another folder, please modify the file path accordingly.



5) Add Spark jar file

We need to download and Spark latest package and unzip it
Go to: https://spark.apache.org/downloads.html
Downoad spark package, you can choose 2.4 or 2.6 based on your requirement. For example, a sampe spark-1.4.0-bin-hadoop2.4.tgz can be downloaded here.
After unzip, we can add the package in our project, click OK with the popup.

6) Set run configuration

in the IntelliJ menu, Run-->Edit Configuration, please choose Application and set up the content as the screenshot below


Final: Run it!

Click the run button on the toolbar, and the result is good!


Please note that in the beginning it will display SLF4J multiple binding problem and Winutil problem like java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. These can be ignored for now.

Happy Spark!


If you think this article is useful, please click the ads on this page to help. Thank you very much.

3 comments:

Unknown said...

Awesome Blog. Very Helpful.

Anonymous said...

Wow ! Helped me a lot ! Thank You very much !

Unknown said...

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache spark mlib, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Apache spark mlib. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.

For Free Demo Contact us:
Name : Arunkumar U
Email : arun@maxmunus.com
Skype id: training_maxmunus
Contact No.-+91-9738507310
Company Website –http://www.maxmunus.com