Tales From the Data

~an informal portfolio~

Extracting Data with Facebook API (II)

Using Facebook-sdk for Python Now I've got the basic usage down for the Facebook API, I need to access it through a Python script that can gather years worth of data and also grab the children (comments and replies to comments). I could just use the usual requests library, but there happens to be a lovely facebook graph api package, facebook-sdk, already tailored for facebook's peculiarities.

All you have to do is create a connection, using the token you got ala the last post, and use the request method to retrieve the posts. Note: When using Facebook-sdk it's important to specify the version, it can be glitchy otherwise. Also, the limit parameter must be set or else it defaults to retrieving 25 posts/comments, and the dates refer to the last updated date- Facebook doesn't provide the date the post was created :((though it does for comments).

InĀ [26]:
import facebook

graph = facebook.GraphAPI(access_token=fb_token, version='3.1')
posts = graph.request(group_id+'/?fields=feed.since(2019-02-01).until(2019-02-08).limit(100)')
print(len(posts['feed']['data']))
26

And voila, you get the posts in a dictionary format! Every post is stored in 'data', as a message with an id. Next, I added a function to grab comments and replies for each post, by searching for comments on the post id. Then all that's left is exporting the dictionary to json and making sure I sorted the replies/comments/posts properly, with a json viewer.

sample data

Note: it's important to have a good json viewer.

The Good Json

I like this one.

One grown-up programmer thing I did in my code was use a config file to store the group_id and access_token, just by importing and doing config.access_token, and then adding the config file to my gitignore file so that it's ignored. Config file always sounds so intimidating to me, like if you touch the file wrong a space shuttle will blow up, but in this case it's just a couple variables. I also tried using logger, and actually catching exceptions, but I don't think I quite have the hang of it. A task for another day.

API Rate Limits

The next problem is preventing errors from the 200 calls an hour limit facebook places. As you can see, just a week's posts totals 26- to retrieve comments for each post, I'll have to make an API call, and for any replies to each comment- assuming each have just 10 comments and replies, that's more than 200 calls.

A proper way to manage this is by accessing the X-App-Usage headers, to always check the percentage of calls left, but standard requests functions to get header data are limited with the facebook-sdk and I'm anxious to explore the data, so I just used the old time.sleep() function in between API calls. I can also view the API calls in the Facebook App Dashboard, shown here before I added pausing between calls.

API calls

After the last calls shown I got an error when trying to grab more data, because it had already past 200. But after adding only a 2 second pause between calls, which should still exceed the published limits, the dashboard doesn't show any calls, so seems Facebook's not too strict before throttling.

Before I invest too much in making an efficient script and getting #allthedata, I want to check out a sample in Pandas and make sure I'm getting things in the right format. If you're curious about the current state of this script, you can see it at the niffler repo.

Comments