Flattening Jsons with Pandas You may recall the json view showed messy nested dictionaries. To make it more readable, I altered what level of the data it extracts, but still need to do something to view it as a table. In the next post I'll do a different structuring approach for downloading the bulk data, but for now I just want to look some more at what I have.
Out of the box, Pandas does not like the json.
pd.read_json(os.path.join(wdir,'output/full_02_03_2019.json'), orient='records').head(3)
Luckily, there's a tool to flatten the records further. First I loaded the records as 'nested_data' using the Python json module, and iterated through to make 'id' labels more specific. If you just run json_normalize(nested_data) it will show same as above. To break it up I'll create separate tables for different nodes.
from pandas.io.json import json_normalize
posts = json_normalize(nested_data)[['post_id','message','updated_time']]
comments = json_normalize(nested_data, record_path='comments', meta=['post_id'])
comments.head(3)

Observe the blacked out name under 'replies'. Although the Facebook Graph API doesn't grab user ID's or names on posts now, the replies and any tags will contain user names. This is something I will have to filter out.
Unfortunately, when I try to flatten the subrecord, replies, it can't handle the NaNs, and results in key errors even if I give them the keys with null values. For now I'll just look at posts and comments. I've also changed the time format to just being a date. (Don't ask me why it doesn't include the time the post is created, just updated..)
full_df = pd.merge(posts, comments, on='post_id')
full_df['post_updated']=pd.to_datetime(full_df['updated_time'])
full_df['message_created']=pd.to_datetime(full_df['created_time'])
full_df[['message_x', 'post_updated','message_y', 'message_created']].head(2)
This quick look showed me a few things:
- I need to distinguish keys, particularly the times which refer to different components
- I need to see if I can get a group member list of names to filter out from posts. I really wish Names or IDs /were/ included so that I could filter repeat answers (if 'Susan' always suggests same thing, it's not being uniquely recommended).
- Facebook's privacy policy is inconsistent and confusing!