Tales From the Data

~an informal portfolio~

Exploring Json Data With Pandas

Flattening Jsons with Pandas You may recall the json view showed messy nested dictionaries. To make it more readable, I altered what level of the data it extracts, but still need to do something to view it as a table. In the next post I'll do a different structuring approach for downloading the bulk data, but for now I just want to look some more at what I have.

Out of the box, Pandas does not like the json.

In [8]:
pd.read_json(os.path.join(wdir,'output/full_02_03_2019.json'), orient='records').head(3)
Out[8]:
comments id message updated_time
0 [{'message': 'Not ocular but yes migraines are... 6.456587e+27 Do any of you get ocular migraines? Any relati... 2019-02-02 19:04:06
1 [{'message': 'Make shure you air your rooms an... 6.456587e+27 Anyone know of any safe nebulizer treatments t... 2019-02-02 17:02:13
2 [{'message': 'I looked up the ingredients and ... 6.456587e+27 Has anyone had problems with Bruegger's plain ... 2019-02-01 21:37:49

Luckily, there's a tool to flatten the records further. First I loaded the records as 'nested_data' using the Python json module, and iterated through to make 'id' labels more specific. If you just run json_normalize(nested_data) it will show same as above. To break it up I'll create separate tables for different nodes.

In [ ]:
from pandas.io.json import json_normalize
posts = json_normalize(nested_data)[['post_id','message','updated_time']]
comments = json_normalize(nested_data, record_path='comments', meta=['post_id'])
comments.head(3)

Results Capture

Observe the blacked out name under 'replies'. Although the Facebook Graph API doesn't grab user ID's or names on posts now, the replies and any tags will contain user names. This is something I will have to filter out.

Unfortunately, when I try to flatten the subrecord, replies, it can't handle the NaNs, and results in key errors even if I give them the keys with null values. For now I'll just look at posts and comments. I've also changed the time format to just being a date. (Don't ask me why it doesn't include the time the post is created, just updated..)

In [21]:
full_df = pd.merge(posts, comments, on='post_id')
full_df['post_updated']=pd.to_datetime(full_df['updated_time'])
full_df['message_created']=pd.to_datetime(full_df['created_time'])
full_df[['message_x', 'post_updated','message_y', 'message_created']].head(2)
Out[21]:
message_x post_updated message_y message_created
0 Do any of you get ocular migraines? Any relati... 2019-02-02 19:04:06 Not ocular but yes migraines are my primary re... 2019-01-24 20:05:45
1 Do any of you get ocular migraines? Any relati... 2019-02-02 19:04:06 I have ocular migraines occasionally, but... 2019-01-24 22:58:51

This quick look showed me a few things:

  • I need to distinguish keys, particularly the times which refer to different components
  • I need to see if I can get a group member list of names to filter out from posts. I really wish Names or IDs /were/ included so that I could filter repeat answers (if 'Susan' always suggests same thing, it's not being uniquely recommended).
  • Facebook's privacy policy is inconsistent and confusing!

Comments