Tales From the Data

~an informal portfolio~

SXSW Free Show Sorter

Every March the SXSW film/music/tech festival invades Austin for a week plus. There's even a job fair, which /used/ to be free and amazing (if mostly out of town companies hiring). But this year it required a trade show badge- badges for the different parts range from \$500 - \$2000, which is pretty exclusivist if you ask me, but it's ok. What matters is that there's still a ton of free music shows, every day, with local bands and those from as far away as Germany.

Exploring Json Data With Pandas

Flattening Jsons with Pandas You may recall the json view showed messy nested dictionaries. To make it more readable, I altered what level of the data it extracts, but still need to do something to view it as a table. In the next post I'll do a different structuring approach for downloading the bulk data, but for now I just want to look some more at what I have.

Extracting Data with Facebook API (II)

Using Facebook-sdk for Python Now I've got the basic usage down for the Facebook API, I need to access it through a Python script that can gather years worth of data and also grab the children (comments and replies to comments). I could just use the usual requests library, but there happens to be a lovely facebook graph api package, facebook-sdk

Windows File Paths

Another day, another Windows quirk. Today I found an explanation and trick (read, proper way) to write a file path in Python (or probably any language). I am sure this has been the reason behind inconsistent "relative path" success many times in the past year or so of working with data.

The most recent head-against-wall experience:

Analyzing Astronomical Data in Apache Spark- Discussion

Why? After a seminar style course on data science, my professor invited us to do our MS project (a non-verbose thesis) using Apache Spark, a new and popular engine for distributed computing. Since my previous degree was in physics, he suggested I look there for data. Having some background in astronomy, I knew there was plenty of free, accessible data available there, and turned my nose in that direction. In my initial research, I found that very little work had been done with Spark in astronomy research, and then found this delightful new Python library, astroML

Astronomical Data in Spark: GMM Models From Prepared Data

Gaussian Mixture Model

This notebook performs 2 forms of Gaussian Mixture models algorithms to find clusters in flux space on stellar data 1) Spark ML GMM module (on RDD) 2) Sci-kit learn basic GMM on (numpy aray)

  • Data has been preprocessed in Spark as dataframes and converted to numpy arrays