The goal of this multi-part guide is to walk the reader through setting up distributed data analysis on a remote server (AWS) with Apache Spark, simply using Cygwin and Jupyter, in your browser. On windows!
Why? Everybody's talking about Jupyter, some sort of "notebook" thing that let's you see your code and graphs in one neat little page. You should try that out. Oh, and Spark! Spark lets you analyze data fast and easily! "Ok!" You say. "I can do easy." But wait, what you really need to try is AWS, it's /the/ cloud to put your data on. But you're on Windows, you don't have a lot of time, and despite being fairly tech capable, after a few environment path changes it isn't quite working and you're frustrated. This tutorial is designed to hold your hand through the process so you (hopefully) don't have to Google every step.
Since there are different routes you can go from each step, I've decided to break it into parts. This one focuses on creating and connecting to an AWS instance.
*Will replace low resolution images- there was a disconnect with an editor preview and actual publishing
Summary of Steps:
- Create AWS account and connect to the server
- Set up Anaconda and Jupyter
- Add Spark and Pyspark settings
- Test it out!
Voila! Click here to jump to quick instructions for accessing the instance after the first-time setup.
Key software for this goal: 1. Cygwin, OpenSSH 2. Amazon Web Services (AWS)
Other resources: This site gives a very informative overview of getting set up on EC2 https://sparkour.urizone.net/recipes/installing-ec2/.
Initial setup
Install Cygwin First download Cygwin (https://cygwin.com/install.html) if you don't already have it. Cygwin is a simple linux emulator for windows- it isn't necessary for using AWS and Spark, but it's handy and has an easy secure shell, OpenSSH for connecting to your AWS instance. Since OpenSSH is not installed by default, use the search box to search for it in the packages and deselect "skip". If you've previously installed Cygwin just open the installer to add OpenSSH.
Create Security Settings
Set up AWS account Then, head over to Amazon Web Services to create an account to use Amazon Web Services, an umbrella for many nifty services, the most popular being its Elastic Compute (generation 2) cloud. EC2 lets you remotely access a server that's maintained by amazon. Think of it as hardware as a service, so that if you happen to need a lot of servers for analyzing a large set of data this week, you don't have to purchase and set up the actual hardware yourself. The term elastic is used because the resources you are billed for stretches or shrinks to fit the amount of resources (space, computing time) you actually used. The free tier lets you use 750 hours a month, with various limitations. Since they require a credit card, make sure you pay attention to which services you select in the following steps, and at sign up, opt for usage alerts via email.
Navigating the AWS: After you set up an account, you can select services from the management console which you can generally access via the golden cube on the upper left of the page. Settings and billing can be accessed on the upper right, where you'll also see the time zone of servers you have access to.
We will use the circled services, starting with EC2.
Keys Select EC2 from the services list, then on the left the "dashboard" should list Key Pairs under Network & Security. Click Create Key Pair to and give it a relevant name for the project you'be creating this Linux instance for. Doing so creates a .pem file which is automatically downloaded- save this in a location easily accessed from command line. For more informations on creating keys, see the User Guide. Next, return to the services list and select IAM.
IAM IAM, Identity Access Management, helps manage the security of your instance and data. It allows you to easily set which types of security are required (access by specific ports for instance). It also lets you easily manage access for different services in the case you were using AWS with a group or company. Add a Role: A Role can specify which services an instance is able to use, and is a good idea to add before starting an instance since they cannot be added later. - Select the IAM service from under Security, Identity & Compliance. - On the left, select Roles and Create New Role
Security Group
Back to EC2, choose Security Group from the side dashboard and then Create Security Group. Give the group a name (same as group name) and description such as "For using Spark" then give it some rules like those displayed here:
When you add each rule you can select the type, protocol, port range and allowed source- if you want to be able to access from any IP leave it as 0, but if you know you'll mostly be using one IP you can select "My IP" (and edit the rules as needed).
Though these rules can also be set when you create an instance, having a saved group makes future instance creation easier.
Start instance
Create a Linux Instance Finally! Choose "instances" from the EC2 side panel, and select the Launch Instance button. After each configuration step, you can either launch with defaults or continue to next step to configure further- be sure to always select this option.
Steps:
- Here you can select which Amazon Machine image (operating system + application server + applications) best suits your needs. I chose the Linux system (first choice), which includes Java, Python and AWS command line if you want to try using that.
- Next, pick a processor setup based on what is available with your account. For free this is t2.micro, which includes 1 processor and 1 GB of memory. Select Next to keep configuring.
- If you made an IAM role above, you can select it here, about halfway down the settings- also select "Protect against accidental termination" to keep the instance from accidental deletion.
- Keep storage default
- Add any tags
- Since you created a security group previously, you simply have to choose it here from the dropdown menu.
Now, hit the blue Review and Launch button and after confirming your settings, launch your instance. It may ask you if you want to create a new key pair or an existing one. Remember to save the .pem file in an easily accessed/short address location.
Connect to the Instance
Launching the instance simply turns it on on Amazon's side. We still need to connect to it from our side to use it. Interacting with the instance is done through command line. Many tutorials have you using Putty or other means to log in and access the instance but I found Cygwin's OpenSSH to be easy enough using the command offered by Amazon. Every time you want to connect to the instance, login and select EC2 from the AWS console, then choose Instances from the left sidebar. You should see this on your screen:
Information about the selected instance, including the public IP will be shown on bottom.
Right click the instance you launched and click Connect. A window pops up like so:
which tells you everything you need to connect. In Cygwin, navigate to the directory which contains your key (.pem) file from earlier and copy the example command provided. Tip: You can use cntrl+shift+insert to paste into Cywgin Example:
ssh -i "sparkey.pem" ec2-user@ec2-##-##-###-##.us-west-2.compute.amazonaws.com
- 'ssh' opens up OpenSSH secure shell, '-i filename' gives the public key authorization file, 'ec2-user' is the username, followed by the address of the host you are connecting to.
- Be sure to alter the file path to your key if needed.
Type 'yes' when it asks if you still want to connect, despite its inability to establish authenticity of the host.
You should now be connected!