Analyzing Astronomical Data in Apache Spark- Discussion

Why? After a seminar style course on data science, my professor invited us to do our MS project (a non-verbose thesis) using Apache Spark, a new and popular engine for distributed computing. Since my previous degree was in physics, he suggested I look there for data. Having some background in astronomy, I knew there was plenty of free, accessible data available there, and turned my nose in that direction. In my initial research, I found that very little work had been done with Spark in astronomy research, and then found this delightful new Python library, astroML, and text on machine learning in astronomy, with many very clear examples on analyzing data from the Sloan Digital Sky Survey (SDSS): Statistics, Data Mining, and Machine Learning in Astronomy

With all the code available to download, it seemed like a straightforward plan- pick an algorithm from the many types of works introduced in the text that could benefit from Spark's style of distribution, get the data, implement in Spark. Because Spark supports Python and has its own machine learing library! In the end I was unable to achieve the algorithm I selected, which I would have realized sooner had I communicated more with my professor. Lesson learned.

The main goal was to explore the potential of using Spark to process and analyze astronomical data. Specifically, the plan was to use PySpark to implement a modified Gaussian mixture model available from astroML, a library of Python based machine learning code from University of Washington, and use it to characterize photometric data from the Sloan Digital Sky Survey.

Tales From the Data

~an informal portfolio~

Analyzing Astronomical Data in Apache Spark- Discussion

Comments