Data Wrangling

Part 1: Gathering, Assessing, and Cleaning

Step 1: Gather

In [1]:
import pandas as pd
import requests
import tweepy
import os
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_colwidth', -1) # Displays the full text in a column instead of truncating it
In [2]:
archive = pd.read_csv('twitter-archive-enhanced.csv')
In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)
with open('image_predictions.tsv', 'wb') as file:
    file.write(r.content)
In [4]:
predictions = pd.read_csv('image_predictions.tsv', sep='\t')
In [5]:
consumer_key = 'YOUR_CONSUMER_KEY_HERE'
consumer_secret = 'YOUR_CONSUMER_SECRET_HERE'
access_token = 'YOUR_ACCESS_TOKEN_HERE'
access_secret = 'YOUR_ACCESS_SECRET_HERE'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
In [6]:
#with open('tweet_json.txt', 'w') as file:
#    for tweet in archive.tweet_id:
#        try:
#            jsonstuff = api.get_status(tweet, tweet_mode='extended')._json
#            json.dump(jsonstuff, file)
#            file.write('\n')
#            print(tweet)
#        except tweepy.TweepError as err:
#            print(err.api_code)
In [7]:
wanted_keys = ['id', 'favorite_count', 'favorited', 'is_quote_status', 'retweet_count', 'retweeted']
tweet_json_list = []

with open('tweet_json.txt', 'r') as file:
    for line in file:
        json_line = json.loads(line)
        selected_keys = {k:v for k, v in json_line.items() if k in wanted_keys}
        tweet_json_list.append(selected_keys)
In [8]:
extended = pd.DataFrame(tweet_json_list)

Step 2: Assess

2.1: Visual assessment

In [9]:
archive.head()
Out[9]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU NaN NaN NaN https://twitter.com/dog_rates/status/892420643555336193/photo/1 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV NaN NaN NaN https://twitter.com/dog_rates/status/892177421306343426/photo/1 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB NaN NaN NaN https://twitter.com/dog_rates/status/891815181378084864/photo/1 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ NaN NaN NaN https://twitter.com/dog_rates/status/891689557279858688/photo/1 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f NaN NaN NaN https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 12 10 Franklin None None None None

Issues spotted:

Dirtiness:
  • many NaN values in columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp
Messiness:
  • doggo/floofer/pupper/puppo: four columns for one variable
In [10]:
archive.tail()
Out[10]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
2351 666049248165822465 NaN NaN 2015-11-16 00:24:50 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq NaN NaN NaN https://twitter.com/dog_rates/status/666049248165822465/photo/1 5 10 None None None None None
2352 666044226329800704 NaN NaN 2015-11-16 00:04:52 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx NaN NaN NaN https://twitter.com/dog_rates/status/666044226329800704/photo/1 6 10 a None None None None
2353 666033412701032449 NaN NaN 2015-11-15 23:21:54 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR NaN NaN NaN https://twitter.com/dog_rates/status/666033412701032449/photo/1 9 10 a None None None None
2354 666029285002620928 NaN NaN 2015-11-15 23:05:30 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI NaN NaN NaN https://twitter.com/dog_rates/status/666029285002620928/photo/1 7 10 a None None None None
2355 666020888022790149 NaN NaN 2015-11-15 22:32:08 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj NaN NaN NaN https://twitter.com/dog_rates/status/666020888022790149/photo/1 8 10 None None None None None

Issues spotted:

Dirtiness:
  • Single lowercase letter ('a') in the name column instead of None.
In [11]:
predictions.head()
Out[11]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

Issues spotted:

Dirtiness:
  • inconsistent cases (some breed names uppercase, some lowercase)
In [12]:
predictions.tail()
Out[12]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
2070 891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg 2 basset 0.555712 True English_springer 0.225770 True German_short-haired_pointer 0.175219 True
2071 891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg 1 paper_towel 0.170278 False Labrador_retriever 0.168086 True spatula 0.040836 False
2072 891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg 1 Chihuahua 0.716012 True malamute 0.078253 True kelpie 0.031379 True
2073 892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg 1 Chihuahua 0.323581 True Pekinese 0.090647 True papillon 0.068957 True
2074 892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg 1 orange 0.097049 False bagel 0.085851 False banana 0.076110 False
In [13]:
extended.head()
Out[13]:
favorite_count favorited id is_quote_status retweet_count retweeted
0 38145 False 892420643555336193 False 8346 False
1 32716 False 892177421306343426 False 6168 False
2 24632 False 891815181378084864 False 4082 False
3 41490 False 891689557279858688 False 8488 False
4 39655 False 891327558926688256 False 9181 False
In [14]:
extended.tail()
Out[14]:
favorite_count favorited id is_quote_status retweet_count retweeted
2337 108 False 666049248165822465 False 41 False
2338 295 False 666044226329800704 False 139 False
2339 124 False 666033412701032449 False 43 False
2340 128 False 666029285002620928 False 47 False
2341 2535 False 666020888022790149 False 507 False

2.2: Programmatic assessment

In [15]:
archive.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
In [16]:
archive.describe()
Out[16]:
tweet_id in_reply_to_status_id in_reply_to_user_id retweeted_status_id retweeted_status_user_id rating_numerator rating_denominator
count 2.356000e+03 7.800000e+01 7.800000e+01 1.810000e+02 1.810000e+02 2356.000000 2356.000000
mean 7.427716e+17 7.455079e+17 2.014171e+16 7.720400e+17 1.241698e+16 13.126486 10.455433
std 6.856705e+16 7.582492e+16 1.252797e+17 6.236928e+16 9.599254e+16 45.876648 6.745237
min 6.660209e+17 6.658147e+17 1.185634e+07 6.661041e+17 7.832140e+05 0.000000 0.000000
25% 6.783989e+17 6.757419e+17 3.086374e+08 7.186315e+17 4.196984e+09 10.000000 10.000000
50% 7.196279e+17 7.038708e+17 4.196984e+09 7.804657e+17 4.196984e+09 11.000000 10.000000
75% 7.993373e+17 8.257804e+17 4.196984e+09 8.203146e+17 4.196984e+09 12.000000 10.000000
max 8.924206e+17 8.862664e+17 8.405479e+17 8.874740e+17 7.874618e+17 1776.000000 170.000000

Issues spotted:

Dirtiness:
  • Many many null values in previously-mentioned columns; some nulls in expanded_urls

  • Invalid data types: timestamp & retweeted_status_timestamp

  • Invalid data: the project brief specified only original tweets, so the 78 replies and 181 retweets (and any others) are invalid for this project.

  • 0 values in rating_numerator and rating_denominator

In [17]:
predictions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
In [18]:
predictions.describe()
Out[18]:
tweet_id img_num p1_conf p2_conf p3_conf
count 2.075000e+03 2075.000000 2075.000000 2.075000e+03 2.075000e+03
mean 7.384514e+17 1.203855 0.594548 1.345886e-01 6.032417e-02
std 6.785203e+16 0.561875 0.271174 1.006657e-01 5.090593e-02
min 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 6.764835e+17 1.000000 0.364412 5.388625e-02 1.622240e-02
50% 7.119988e+17 1.000000 0.588230 1.181810e-01 4.944380e-02
75% 7.932034e+17 1.000000 0.843855 1.955655e-01 9.180755e-02
max 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01
In [19]:
extended.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2342 entries, 0 to 2341
Data columns (total 6 columns):
favorite_count     2342 non-null int64
favorited          2342 non-null bool
id                 2342 non-null int64
is_quote_status    2342 non-null bool
retweet_count      2342 non-null int64
retweeted          2342 non-null bool
dtypes: bool(3), int64(3)
memory usage: 61.8 KB
In [20]:
extended.describe()
Out[20]:
favorite_count id retweet_count
count 2342.000000 2.342000e+03 2342.000000
mean 7984.561913 7.422212e+17 2943.631939
std 12359.878226 6.832408e+16 4950.322617
min 0.000000 6.660209e+17 0.000000
25% 1378.250000 6.783509e+17 591.500000
50% 3473.500000 7.186224e+17 1374.500000
75% 9781.250000 7.986971e+17 3432.250000
max 164669.000000 8.924206e+17 83898.000000

Issues spotted:

Dirtiness:
  • 0 values in favorite_count and retweet_count seem unlikely
Messiness:
  • This information should be joined with the archive information

I want to pick out a few more variables/areas that I think might have issues...:

In [21]:
archive.name.value_counts()
Out[21]:
None            745
a               55 
Charlie         12 
Cooper          11 
Lucy            11 
Oliver          11 
Lola            10 
Tucker          10 
Penny           10 
Bo              9  
Winston         9  
Sadie           8  
the             8  
Buddy           7  
Toby            7  
an              7  
Bailey          7  
Daisy           7  
Leo             6  
Oscar           6  
Stanley         6  
Milo            6  
Scout           6  
Jax             6  
Bella           6  
Koda            6  
Dave            6  
Rusty           6  
Jack            6  
Oakley          5  
               ..  
Stefan          1  
all             1  
Shnuggles       1  
Huck            1  
Nugget          1  
Storkson        1  
Major           1  
Doobert         1  
Laika           1  
Angel           1  
Ralf            1  
Stella          1  
Dook            1  
Jazzy           1  
Kaiya           1  
Lance           1  
Leonidas        1  
Tedrick         1  
Mike            1  
Hector          1  
Stuart          1  
Billy           1  
Norman          1  
Emanuel         1  
Alexanderson    1  
Kallie          1  
such            1  
Bookstore       1  
Raphael         1  
Mairi           1  
Name: name, Length: 957, dtype: int64

Issues spotted:

Dirtiness:
  • Several names look invalid, including 'a', 'the', 'very,' 'an,' and 'unacceptable.'
In [22]:
archive[archive['rating_denominator'] != 10]
Out[22]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
313 835246439529840640 8.352460e+17 2.625958e+07 2017-02-24 21:54:03 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho NaN NaN NaN NaN 960 0 None None None None None
342 832088576586297345 8.320875e+17 3.058208e+07 2017-02-16 04:45:50 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> @docmisterio account started on 11/15/15 NaN NaN NaN NaN 11 15 None None None None None
433 820690176645140481 NaN NaN 2017-01-15 17:52:40 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd NaN NaN NaN https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1 84 70 None None None None None
516 810984652412424192 NaN NaN 2016-12-19 23:06:23 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx NaN NaN NaN https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1 24 7 Sam None None None None
784 775096608509886464 NaN NaN 2016-09-11 22:20:06 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/… 7.403732e+17 4.196984e+09 2016-06-08 02:41:38 +0000 https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1 9 11 None None None None None
902 758467244762497024 NaN NaN 2016-07-28 01:00:57 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE NaN NaN NaN https://twitter.com/dog_rates/status/758467244762497024/video/1 165 150 None None None None None
1068 740373189193256964 NaN NaN 2016-06-08 02:41:38 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ NaN NaN NaN https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1 9 11 None None None None None
1120 731156023742988288 NaN NaN 2016-05-13 16:15:54 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv NaN NaN NaN https://twitter.com/dog_rates/status/731156023742988288/photo/1 204 170 this None None None None
1165 722974582966214656 NaN NaN 2016-04-21 02:25:47 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a NaN NaN NaN https://twitter.com/dog_rates/status/722974582966214656/photo/1 4 20 None None None None None
1202 716439118184652801 NaN NaN 2016-04-03 01:36:11 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq NaN NaN NaN https://twitter.com/dog_rates/status/716439118184652801/photo/1 50 50 Bluebert None None None None
1228 713900603437621249 NaN NaN 2016-03-27 01:29:02 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1 NaN NaN NaN https://twitter.com/dog_rates/status/713900603437621249/photo/1 99 90 None None None None None
1254 710658690886586372 NaN NaN 2016-03-18 02:46:49 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12 NaN NaN NaN https://twitter.com/dog_rates/status/710658690886586372/photo/1 80 80 None None None None None
1274 709198395643068416 NaN NaN 2016-03-14 02:04:08 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK NaN NaN NaN https://twitter.com/dog_rates/status/709198395643068416/photo/1 45 50 None None None None None
1351 704054845121142784 NaN NaN 2016-02-28 21:25:30 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa NaN NaN NaN https://twitter.com/dog_rates/status/704054845121142784/photo/1 60 50 a None None None None
1433 697463031882764288 NaN NaN 2016-02-10 16:51:59 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ NaN NaN NaN https://twitter.com/dog_rates/status/697463031882764288/photo/1 44 40 None None None None None
1598 686035780142297088 6.860340e+17 4.196984e+09 2016-01-10 04:04:10 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Yes I do realize a rating of 4/20 would've been fitting. However, it would be unjust to give these cooperative pups that low of a rating NaN NaN NaN NaN 4 20 None None None None None
1634 684225744407494656 6.842229e+17 4.196984e+09 2016-01-05 04:11:44 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3 NaN NaN NaN https://twitter.com/dog_rates/status/684225744407494656/photo/1,https://twitter.com/dog_rates/status/684225744407494656/photo/1 143 130 None None None None None
1635 684222868335505415 NaN NaN 2016-01-05 04:00:18 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55 NaN NaN NaN https://twitter.com/dog_rates/status/684222868335505415/photo/1 121 110 None None None None None
1662 682962037429899265 NaN NaN 2016-01-01 16:30:13 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5 NaN NaN NaN https://twitter.com/dog_rates/status/682962037429899265/photo/1 7 11 Darrel None None None None
1663 682808988178739200 6.827884e+17 4.196984e+09 2016-01-01 06:22:03 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> I'm aware that I could've said 20/16, but here at WeRateDogs we are very professional. An inconsistent rating scale is simply irresponsible NaN NaN NaN NaN 20 16 None None None None None
1779 677716515794329600 NaN NaN 2015-12-18 05:06:23 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq NaN NaN NaN https://twitter.com/dog_rates/status/677716515794329600/photo/1 144 120 None None None None None
1843 675853064436391936 NaN NaN 2015-12-13 01:41:41 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw NaN NaN NaN https://twitter.com/dog_rates/status/675853064436391936/photo/1,https://twitter.com/dog_rates/status/675853064436391936/photo/1 88 80 None None None None None
2335 666287406224695296 NaN NaN 2015-11-16 16:11:11 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv NaN NaN NaN https://twitter.com/dog_rates/status/666287406224695296/photo/1 1 2 an None None None None

Issues spotted:

Dirtiness:
  • Several mistakes in denominators and their associated numerators (9/11, 4/20, etc.). The script appears to have pulled the wrong fraction from the tweet in those cases. Five of these are in original tweets: row 1068, 1165, 1202, 1662, 2335. I think they have to be fixed manually this time.
In [23]:
predictions[predictions['jpg_url'].duplicated()]
predictions[predictions['jpg_url'] == 'https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg']
Out[23]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
1970 868880397819494401 https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg 1 laptop 0.153718 False French_bulldog 0.099984 True printer 0.07713 False
1992 873697596434513921 https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg 1 laptop 0.153718 False French_bulldog 0.099984 True printer 0.07713 False

Issues spotted:

Dirtiness:
  • Duplicated jpg urls/predictions with different tweet ids
In [24]:
extended[extended['favorite_count'] == 0]
Out[24]:
favorite_count favorited id is_quote_status retweet_count retweeted
31 0 False 886054160059072513 True 105 False
35 0 False 885311592912609280 False 18237 False
67 0 False 879130579576475649 False 6721 False
72 0 False 878404777348136964 False 1270 False
73 0 False 878316110768087041 False 6558 False
77 0 False 877611172832227328 False 80 False
90 0 False 874434818259525634 False 14574 False
95 0 False 873337748698140672 False 1572 False
106 0 False 871166179821445120 False 5678 False
120 0 False 868639477480148993 False 2103 False
126 0 False 867072653475098625 False 121 False
132 0 False 866094527597207552 False 8448 False
141 0 False 863471782782697472 False 2486 False
153 0 False 860981674716409858 False 2195 False
154 0 False 860924035999428608 False 838 False
159 0 False 860177593139703809 False 31506 False
165 0 False 858860390427611136 False 8343 False
174 0 False 857062103051644929 False 177 False
176 0 False 856602993587888130 False 11011 False
179 0 False 856330835276025856 False 700 False
188 0 False 855245323840757760 False 6198 False
189 0 False 855138241867124737 False 47 False
198 0 False 852936405516943360 False 2104 False
205 0 False 851953902622658560 False 10112 False
206 0 False 851861385021730816 False 23 False
216 0 False 849668094696017920 False 5235 False
224 0 False 847978865427394560 False 3448 False
225 0 False 847971574464610304 False 452 False
243 0 False 845098359547420673 False 8151 False
258 0 False 841833993020538882 False 16400 False
... ... ... ... ... ... ...
762 0 False 776249906839351296 False 10005 False
767 0 False 775898661951791106 False 17949 False
782 0 False 773336787167145985 False 5556 False
788 0 False 772615324260794368 False 3680 False
799 0 False 771171053431250945 False 8242 False
803 0 False 771004394259247104 False 240 False
809 0 False 770093767776997377 False 3308 False
813 0 False 769335591808995329 False 8371 False
816 0 False 768909767477751808 False 2941 False
820 0 False 768554158521745409 False 6337 False
828 0 False 766864461642756096 False 6149 False
834 0 False 766078092750233600 False 2827 False
847 0 False 763167063695355904 False 3277 False
855 0 False 761750502866649088 False 4275 False
859 0 False 761371037149827077 False 19485 False
872 0 False 760153949710192640 False 34 False
877 0 False 759566828574212096 False 22897 False
882 0 False 759159934323924993 False 1262 False
895 0 False 757729163776290825 False 8779 False
898 0 False 757597904299253760 False 315 False
913 0 False 754874841593970688 False 8673 False
923 0 False 753298634498793472 False 6242 False
929 0 False 752701944171524096 False 3119 False
935 0 False 752309394570878976 False 18041 False
998 0 False 747242308580548608 False 3098 False
1009 0 False 746521445350707200 False 1057 False
1029 0 False 743835915802583040 False 2237 False
1228 0 False 711998809858043904 False 134 False
2245 0 False 667550904950915073 False 35 False
2246 0 False 667550882905632768 False 33 False

168 rows × 6 columns

In [25]:
extended[extended['retweet_count'] == 0]
Out[25]:
favorite_count favorited id is_quote_status retweet_count retweeted
283 149 False 838085839343206401 False 0 False

Issues spotted:

Dirtiness:
  • Not sure what to make of these 0 values for favorite and retweet count. They seem unlikely, but the other info appears valid.

Issues Summary:

Quality (Dirty Data):

Tweet archive:

  • many NaN values in columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls
  • Invalid data types: timestamp, retweeted_status_timestamp
  • Invalid data in retweets/replies, according to the project brief
  • 0 values in rating_numerator and rating_denominator
  • Several names look invalid, including 'a', 'the', 'very,' 'an,' and 'unacceptable.'
  • Five tweets pulled the wrong fraction as their rating (4/20, 7/11, etc.)

Predictions:

  • Inconsistent cases (some breed names uppercase, some lowercase)
  • Duplicated jpg urls/predictions with different tweet ids

Extended tweet info (from the API):

  • 0 values in favorite_count and retweet_count seem unlikely

Tidiness (Messy Data):

Tweet archive:

  • doggo/floofer/pupper/puppo: four columns for one variable
  • Contains info about dogs and info about tweets in one table

Extended tweet info:

  • This information does not need to be in its own table

Step 3: Clean

  • Merge archive and extended tables to fill in some missing values (this will also address tidiness issue #3)
  • Correct invalid data types: timestamp and retweeted_status_timestamp to datetime
  • Several rating and name mistakes appear to be retweets or replies: delete retweets/replies and reassess (can delete retweet/reply-relevant columns after this)
  • Fill in 0 values in favorite_count and retweet_count (Spoiler alert: this gets taken care of by deleting retweets/replies)
  • Fix 0 value rating numerators/denominators
  • Manually fix five invalid ratings
  • Go through invalid names and change to correct name, if one exists, or None
  • Change inconsistent cases in predictions to all lowercase
  • Append extended_urls from archive to predictions and reassess duplicates
  • Create new table to separate dog info from tweet info
  • Melt doggo/floofer/pupper/puppo columns.

Define:

  • Merge archive and extended tables to fill in some missing values (this will also address tidiness issue #3)

Code:

In [26]:
archive_clean = archive.copy()
predictions_clean = predictions.copy()
extended_clean = extended.copy()
In [27]:
full_archive = pd.merge(archive_clean, extended_clean, how='left', left_on='tweet_id', right_on='id').drop('id', axis=1)

Test:

In [28]:
full_archive.head()
Out[28]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls ... name doggo floofer pupper puppo favorite_count favorited is_quote_status retweet_count retweeted
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU NaN NaN NaN https://twitter.com/dog_rates/status/892420643555336193/photo/1 ... Phineas None None None None 38145.0 False False 8346.0 False
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV NaN NaN NaN https://twitter.com/dog_rates/status/892177421306343426/photo/1 ... Tilly None None None None 32716.0 False False 6168.0 False
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB NaN NaN NaN https://twitter.com/dog_rates/status/891815181378084864/photo/1 ... Archie None None None None 24632.0 False False 4082.0 False
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ NaN NaN NaN https://twitter.com/dog_rates/status/891689557279858688/photo/1 ... Darla None None None None 41490.0 False False 8488.0 False
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f NaN NaN NaN https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 ... Franklin None None None None 39655.0 False False 9181.0 False

5 rows × 22 columns

Define:

  • Correct invalid data types: timestamp and retweeted_status_timestamp to datetime
  • Merge appears to have also turned ints into floats and bools to str, turn to int and bool respectively
In [29]:
full_archive.timestamp = pd.to_datetime(full_archive.timestamp, errors='coerce', infer_datetime_format=True)
full_archive.retweeted_status_timestamp = pd.to_datetime(full_archive.retweeted_status_timestamp, errors='coerce', infer_datetime_format=True)
In [30]:
full_archive.favorite_count = full_archive.favorite_count.fillna(0).astype(int)
full_archive.retweet_count = full_archive.retweet_count.fillna(0).astype(int)
full_archive.favorited = full_archive.favorited.astype(bool)
full_archive.retweeted = full_archive.retweeted.astype(bool)
full_archive.is_quote_status = full_archive.is_quote_status.astype(bool)

Test:

In [31]:
full_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 22 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null datetime64[ns]
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null datetime64[ns]
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
favorite_count                2356 non-null int64
favorited                     2356 non-null bool
is_quote_status               2356 non-null bool
retweet_count                 2356 non-null int64
retweeted                     2356 non-null bool
dtypes: bool(3), datetime64[ns](2), float64(4), int64(5), object(8)
memory usage: 375.0+ KB

Define:

  • Several rating and name mistakes appear to be retweets or replies: delete retweets/replies and reassess (can delete retweet/reply-relevant columns after this)

Code:

In [32]:
full_archive = full_archive[full_archive['in_reply_to_status_id'].isnull()]
full_archive = full_archive[full_archive['retweeted'] == False]
full_archive = full_archive[full_archive['is_quote_status'] == False]
full_archive = full_archive[full_archive['favorited'] == False]
In [33]:
full_archive = full_archive[full_archive['retweeted_status_id'].isnull()] # 'Retweeted' missed a few

Test:

In [34]:
full_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2067 entries, 0 to 2355
Data columns (total 22 columns):
tweet_id                      2067 non-null int64
in_reply_to_status_id         0 non-null float64
in_reply_to_user_id           0 non-null float64
timestamp                     2067 non-null datetime64[ns]
source                        2067 non-null object
text                          2067 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null datetime64[ns]
expanded_urls                 2064 non-null object
rating_numerator              2067 non-null int64
rating_denominator            2067 non-null int64
name                          2067 non-null object
doggo                         2067 non-null object
floofer                       2067 non-null object
pupper                        2067 non-null object
puppo                         2067 non-null object
favorite_count                2067 non-null int64
favorited                     2067 non-null bool
is_quote_status               2067 non-null bool
retweet_count                 2067 non-null int64
retweeted                     2067 non-null bool
dtypes: bool(3), datetime64[ns](2), float64(4), int64(5), object(8)
memory usage: 329.0+ KB

Define:

  • Irrelevant columns (relating to retweets/replies) can now be dropped.

Code:

In [35]:
full_archive = full_archive.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 
                                            'retweeted_status_user_id', 'retweeted_status_timestamp', 'favorited',
                                            'is_quote_status', 'retweeted'], axis=1)

Test:

In [36]:
full_archive.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2067 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id              2067 non-null int64
timestamp             2067 non-null datetime64[ns]
source                2067 non-null object
text                  2067 non-null object
expanded_urls         2064 non-null object
rating_numerator      2067 non-null int64
rating_denominator    2067 non-null int64
name                  2067 non-null object
doggo                 2067 non-null object
floofer               2067 non-null object
pupper                2067 non-null object
puppo                 2067 non-null object
favorite_count        2067 non-null int64
retweet_count         2067 non-null int64
dtypes: datetime64[ns](1), int64(5), object(8)
memory usage: 242.2+ KB

Define:

  • Fill in 0 values in favorite_count and retweet_count

Code/Test:

In [37]:
full_archive[full_archive['favorite_count'] == 0]
full_archive[full_archive['retweet_count'] == 0] # It turns out these were taken care of by deleting retweets/replies!
Out[37]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo favorite_count retweet_count

Define:

  • Fill in 0s in rating numerators and denominators

Code:

In [38]:
full_archive[full_archive['rating_denominator'] == 0] # This must have also been taken care of by the replies/retweets purge.
Out[38]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo favorite_count retweet_count
In [39]:
full_archive[full_archive['rating_numerator'] == 0] # This one turns out to be a joke and can be deleted as well.
Out[39]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo favorite_count retweet_count
315 835152434251116546 2017-02-24 15:40:31 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> When you're so blinded by your systematic plagiarism that you forget what day it is. 0/10 https://t.co/YbEJPkg4Ag https://twitter.com/dog_rates/status/835152434251116546/photo/1,https://twitter.com/dog_rates/status/835152434251116546/photo/1,https://twitter.com/dog_rates/status/835152434251116546/photo/1 0 10 None None None None None 23644 3253
In [40]:
full_archive = full_archive.drop(315, axis=0)

Test:

In [41]:
full_archive[full_archive['rating_numerator'] == 0]
Out[41]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo favorite_count retweet_count

Define:

  • Manually fix five invalid ratings in rows 1068, 1165, 1202, 1662, 2335.

Code:

In [42]:
full_archive[full_archive['rating_denominator'] != 10]
Out[42]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo favorite_count retweet_count
433 820690176645140481 2017-01-15 17:52:40 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1 84 70 None None None None None 12998 3530
516 810984652412424192 2016-12-19 23:06:23 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1 24 7 Sam None None None None 5706 1577
902 758467244762497024 2016-07-28 01:00:57 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE https://twitter.com/dog_rates/status/758467244762497024/video/1 165 150 None None None None None 5097 2417
1068 740373189193256964 2016-06-08 02:41:38 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1,https://twitter.com/dog_rates/status/740373189193256964/photo/1 9 11 None None None None None 36545 14237
1120 731156023742988288 2016-05-13 16:15:54 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv https://twitter.com/dog_rates/status/731156023742988288/photo/1 204 170 this None None None None 4018 1357
1165 722974582966214656 2016-04-21 02:25:47 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a https://twitter.com/dog_rates/status/722974582966214656/photo/1 4 20 None None None None None 4300 1674
1202 716439118184652801 2016-04-03 01:36:11 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq https://twitter.com/dog_rates/status/716439118184652801/photo/1 50 50 Bluebert None None None None 2474 227
1228 713900603437621249 2016-03-27 01:29:02 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1 https://twitter.com/dog_rates/status/713900603437621249/photo/1 99 90 None None None None None 2965 790
1254 710658690886586372 2016-03-18 02:46:49 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12 https://twitter.com/dog_rates/status/710658690886586372/photo/1 80 80 None None None None None 2423 603
1274 709198395643068416 2016-03-14 02:04:08 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK https://twitter.com/dog_rates/status/709198395643068416/photo/1 45 50 None None None None None 2525 684
1351 704054845121142784 2016-02-28 21:25:30 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa https://twitter.com/dog_rates/status/704054845121142784/photo/1 60 50 a None None None None 3062 976
1433 697463031882764288 2016-02-10 16:51:59 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ https://twitter.com/dog_rates/status/697463031882764288/photo/1 44 40 None None None None None 3588 1473
1635 684222868335505415 2016-01-05 04:00:18 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55 https://twitter.com/dog_rates/status/684222868335505415/photo/1 121 110 None None None None None 4048 1476
1662 682962037429899265 2016-01-01 16:30:13 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5 https://twitter.com/dog_rates/status/682962037429899265/photo/1 7 11 Darrel None None None None 38882 17907
1779 677716515794329600 2015-12-18 05:06:23 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq https://twitter.com/dog_rates/status/677716515794329600/photo/1 144 120 None None None None None 3181 1029
1843 675853064436391936 2015-12-13 01:41:41 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw https://twitter.com/dog_rates/status/675853064436391936/photo/1,https://twitter.com/dog_rates/status/675853064436391936/photo/1 88 80 None None None None None 2782 1367
2335 666287406224695296 2015-11-16 16:11:11 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv https://twitter.com/dog_rates/status/666287406224695296/photo/1 1 2 an None None None None 145 64
In [43]:
full_archive.loc[1068, ['rating_numerator', 'rating_denominator']] = 14, 10
full_archive.loc[1165, ['rating_numerator', 'rating_denominator']] = 13, 10
full_archive.loc[1202, ['rating_numerator', 'rating_denominator']] = 11, 10
full_archive.loc[1662, ['rating_numerator', 'rating_denominator']] = 10, 10
full_archive.loc[2335, ['rating_numerator', 'rating_denominator']] = 9, 10

Test:

In [44]:
full_archive[full_archive['rating_denominator'] != 10]
Out[44]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo favorite_count retweet_count
433 820690176645140481 2017-01-15 17:52:40 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1,https://twitter.com/dog_rates/status/820690176645140481/photo/1 84 70 None None None None None 12998 3530
516 810984652412424192 2016-12-19 23:06:23 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx https://www.gofundme.com/sams-smile,https://twitter.com/dog_rates/status/810984652412424192/photo/1 24 7 Sam None None None None 5706 1577
902 758467244762497024 2016-07-28 01:00:57 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE https://twitter.com/dog_rates/status/758467244762497024/video/1 165 150 None None None None None 5097 2417
1120 731156023742988288 2016-05-13 16:15:54 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv https://twitter.com/dog_rates/status/731156023742988288/photo/1 204 170 this None None None None 4018 1357
1228 713900603437621249 2016-03-27 01:29:02 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1 https://twitter.com/dog_rates/status/713900603437621249/photo/1 99 90 None None None None None 2965 790
1254 710658690886586372 2016-03-18 02:46:49 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12 https://twitter.com/dog_rates/status/710658690886586372/photo/1 80 80 None None None None None 2423 603
1274 709198395643068416 2016-03-14 02:04:08 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK https://twitter.com/dog_rates/status/709198395643068416/photo/1 45 50 None None None None None 2525 684
1351 704054845121142784 2016-02-28 21:25:30 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa https://twitter.com/dog_rates/status/704054845121142784/photo/1 60 50 a None None None None 3062 976
1433 697463031882764288 2016-02-10 16:51:59 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ https://twitter.com/dog_rates/status/697463031882764288/photo/1 44 40 None None None None None 3588 1473
1635 684222868335505415 2016-01-05 04:00:18 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55 https://twitter.com/dog_rates/status/684222868335505415/photo/1 121 110 None None None None None 4048 1476
1779 677716515794329600 2015-12-18 05:06:23 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq https://twitter.com/dog_rates/status/677716515794329600/photo/1 144 120 None None None None None 3181 1029
1843 675853064436391936 2015-12-13 01:41:41 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw https://twitter.com/dog_rates/status/675853064436391936/photo/1,https://twitter.com/dog_rates/status/675853064436391936/photo/1 88 80 None None None None None 2782 1367

Define:

  • Go through invalid names and change to correct name, if one exists, or None
In [45]:
def name_replace(df):
    '''Takes a dataframe. Finds dogs with words ('a', 'quite') as names, who are named in the text, 
    and corrects the name field accordingly. Returns the corrected dataframe.'''
    index = df[(df['name'].str.islower()) & (df['text'].str.contains('named'))].index.values
    for i in index:
        new_name = df.loc[i, 'text'].split('named ')[1].split(' ')[0]
        if new_name.endswith('.'):
            new_name = new_name[:-1]
        df.loc[i, 'name'] = new_name
    return df
In [46]:
full_archive = name_replace(full_archive)
In [47]:
full_archive[(full_archive['name'].str.islower()) & (full_archive['text'].str.contains('name'))] # Two orphan cases to fix manually; the rest can be changed to None.
Out[47]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo favorite_count retweet_count
852 765395769549590528 2016-08-16 03:52:26 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is my dog. Her name is Zoey. She knows I've been rating other dogs. She's not happy. 13/10 no bias at all https://t.co/ep1NkYoiwB https://twitter.com/dog_rates/status/765395769549590528/photo/1 13 10 my None None None None 28069 3687
2287 667177989038297088 2015-11-19 03:10:02 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is a Dasani Kingfisher from Maine. His name is Daryl. Daryl doesn't like being swallowed by a panda. 8/10 https://t.co/jpaeu6LNmW https://twitter.com/dog_rates/status/667177989038297088/photo/1 8 10 a None None None None 191 55
In [48]:
full_archive.loc[852, 'name'] = 'Zoey'
full_archive.loc[2287, 'name'] = 'Daryl'
In [49]:
full_archive.loc[full_archive['name'].str.islower(), 'name'] = None

Test:

In [50]:
full_archive.name.value_counts()
Out[50]:
None        574
Lucy        11 
Cooper      10 
Oliver      10 
Charlie     10 
Penny       9  
Tucker      9  
Winston     8  
Sadie       8  
Lola        8  
Daisy       7  
Toby        7  
Bo          6  
Bailey      6  
Koda        6  
Stanley     6  
Jax         6  
Bella       6  
Oscar       6  
Rusty       5  
Louis       5  
Chester     5  
Milo        5  
Scout       5  
Dave        5  
Buddy       5  
Bentley     5  
Leo         5  
Clarence    4  
Oakley      4  
           ..  
Tanner      1  
Jangle      1  
Cherokee    1  
Brutus      1  
Billy       1  
Mike        1  
Maks        1  
Laika       1  
Jeffrie     1  
Kramer      1  
Pavlov      1  
Rey         1  
Duddles     1  
Josep       1  
Siba        1  
Tebow       1  
Chuk        1  
Brandy      1  
Doobert     1  
Angel       1  
Ralf        1  
Stella      1  
Dook        1  
Jazzy       1  
Hemry       1  
Kaiya       1  
Lance       1  
Leonidas    1  
Tedrick     1  
Mairi       1  
Name: name, Length: 948, dtype: int64

Define:

  • Fix inconsistent cases in predictions to all lowercase
In [51]:
predictions_clean['p1'] = predictions_clean['p1'].str.lower()
predictions_clean['p2'] = predictions_clean['p2'].str.lower()
predictions_clean['p3'] = predictions_clean['p3'].str.lower()

Test:

In [52]:
predictions_clean.head()
Out[52]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 welsh_springer_spaniel 0.465074 True collie 0.156665 True shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 german_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True rottweiler 0.243682 True doberman 0.154629 True

Define:

  • Append extended_urls from archive to predictions and reassess duplicates

Code:

In [53]:
urls_only = full_archive[['tweet_id', 'expanded_urls']]
predictions_clean = pd.merge(predictions_clean, urls_only, how='left', left_on='tweet_id', right_on='tweet_id')
In [54]:
predictions_w_url = predictions_clean[predictions_clean['expanded_urls'].notnull()]

Test:

In [55]:
predictions_w_url[predictions_w_url['jpg_url'].duplicated()]
Out[55]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog expanded_urls
In [56]:
predictions_w_url.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1969 entries, 0 to 2074
Data columns (total 13 columns):
tweet_id         1969 non-null int64
jpg_url          1969 non-null object
img_num          1969 non-null int64
p1               1969 non-null object
p1_conf          1969 non-null float64
p1_dog           1969 non-null bool
p2               1969 non-null object
p2_conf          1969 non-null float64
p2_dog           1969 non-null bool
p3               1969 non-null object
p3_conf          1969 non-null float64
p3_dog           1969 non-null bool
expanded_urls    1969 non-null object
dtypes: bool(3), float64(3), int64(2), object(5)
memory usage: 175.0+ KB
In [57]:
predictions_clean[~predictions_clean['jpg_url'].duplicated()].sample(25) # Trying to figure out what the non-duplicate pictures with NaN URLs are. A lot appear to not be from dogrates. 
Out[57]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog expanded_urls
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True https://twitter.com/dog_rates/status/666044226329800704/photo/1
1819 834209720923721728 https://pbs.twimg.com/media/C5O1UAaWIAAMBMd.jpg 1 golden_retriever 0.754799 True pekinese 0.197861 True labrador_retriever 0.008654 True https://twitter.com/dog_rates/status/834209720923721728/photo/1,https://twitter.com/dog_rates/status/834209720923721728/photo/1
1923 857029823797047296 https://pbs.twimg.com/media/C-TIEwMW0AEjb55.jpg 2 golden_retriever 0.968623 True labrador_retriever 0.010325 True saluki 0.004148 True https://twitter.com/dog_rates/status/857029823797047296/photo/1,https://twitter.com/dog_rates/status/857029823797047296/photo/1
751 688064179421470721 https://pbs.twimg.com/media/CYx-tGaUoAAEXV8.jpg 1 eskimo_dog 0.240602 True norwegian_elkhound 0.180369 True siberian_husky 0.090739 True https://twitter.com/dog_rates/status/688064179421470721/photo/1
1280 750132105863102464 https://pbs.twimg.com/media/CmkBKuwWgAAamOI.jpg 1 toy_poodle 0.478018 True miniature_poodle 0.207458 True croquet_ball 0.085879 False https://twitter.com/dog_rates/status/750132105863102464/photo/1
256 670764103623966721 https://pbs.twimg.com/media/CU8IY0pWIAA2AJ-.jpg 1 norfolk_terrier 0.172850 True golden_retriever 0.072702 True television 0.037494 False https://twitter.com/dog_rates/status/670764103623966721/photo/1
1506 785533386513321988 https://pbs.twimg.com/media/CubGchjXEAA6gpw.jpg 2 miniature_pinscher 0.436023 True black-and-tan_coonhound 0.258049 True rottweiler 0.145231 True https://twitter.com/dog_rates/status/785533386513321988/photo/1,https://twitter.com/dog_rates/status/785533386513321988/photo/1
433 674318007229923329 https://pbs.twimg.com/media/CVuopr8WwAExw_T.jpg 1 porcupine 0.846628 False hamster 0.038136 False echidna 0.014680 False https://twitter.com/dog_rates/status/674318007229923329/photo/1
1705 817423860136083457 https://pbs.twimg.com/ext_tw_video_thumb/817423809049493505/pu/img/5OFW0yueFu9oTUiQ.jpg 1 ice_bear 0.336200 False samoyed 0.201358 True eskimo_dog 0.186789 True https://twitter.com/dog_rates/status/817423860136083457/video/1
1904 852189679701164033 https://pbs.twimg.com/media/C9OV99SXsAEmj1U.jpg 1 barrow 0.423150 False bernese_mountain_dog 0.415374 True entlebucher 0.067345 True https://twitter.com/dog_rates/status/852189679701164033/photo/1
945 704480331685040129 https://pbs.twimg.com/media/CcbRIAgXIAQaKHQ.jpg 1 samoyed 0.979206 True pomeranian 0.007185 True arctic_fox 0.006438 False https://twitter.com/dog_rates/status/704480331685040129/photo/1
708 685198997565345792 https://pbs.twimg.com/media/CYJQxvJW8AAkkws.jpg 1 dishwasher 0.888829 False stove 0.013412 False old_english_sheepdog 0.009671 True https://twitter.com/dog_rates/status/685198997565345792/photo/1
867 697575480820686848 https://pbs.twimg.com/media/Ca5JMvMUsAAGMll.jpg 1 siamese_cat 0.256698 False whippet 0.119805 True bull_mastiff 0.102595 True https://twitter.com/dog_rates/status/697575480820686848/photo/1
2050 887343217045368832 https://pbs.twimg.com/ext_tw_video_thumb/887343120832229379/pu/img/6HSuFrW1lzI_9Mht.jpg 1 mexican_hairless 0.330741 True sea_lion 0.275645 False weimaraner 0.134203 True https://twitter.com/dog_rates/status/887343217045368832/video/1
2045 886366144734445568 https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg 1 french_bulldog 0.999201 True chihuahua 0.000361 True boston_bull 0.000076 True https://twitter.com/dog_rates/status/886366144734445568/photo/1,https://twitter.com/dog_rates/status/886366144734445568/photo/1
618 680221482581123072 https://pbs.twimg.com/media/CXCh0QZW8AALdXm.jpg 1 bubble 0.240173 False hen 0.146222 False abaya 0.139342 False https://twitter.com/dog_rates/status/680221482581123072/photo/1
1555 793195938047070209 https://pbs.twimg.com/media/CwH_foYWgAEvTyI.jpg 2 labrador_retriever 0.654762 True golden_retriever 0.074100 True chihuahua 0.042339 True https://twitter.com/dog_rates/status/793195938047070209/photo/1,https://twitter.com/dog_rates/status/793195938047070209/photo/1
810 692187005137076224 https://pbs.twimg.com/media/CZskaEIWIAUeTr5.jpg 2 siberian_husky 0.810592 True malamute 0.119745 True eskimo_dog 0.029265 True https://twitter.com/dog_rates/status/692187005137076224/photo/1,https://twitter.com/dog_rates/status/692187005137076224/photo/1,https://twitter.com/dog_rates/status/692187005137076224/photo/1
907 700462010979500032 https://pbs.twimg.com/media/CbiKe7-W0AIVNNr.jpg 1 hamster 0.678651 False pomeranian 0.110268 True angora 0.104139 False https://twitter.com/dog_rates/status/700462010979500032/photo/1
869 697616773278015490 https://pbs.twimg.com/media/Ca5uv7RVAAA_QEg.jpg 1 lhasa 0.521931 True shih-tzu 0.403451 True tibetan_terrier 0.039912 True https://twitter.com/dog_rates/status/697616773278015490/photo/1,https://twitter.com/dog_rates/status/697616773278015490/photo/1
1233 746131877086527488 https://pbs.twimg.com/media/ClrK-rGWAAENcAa.jpg 1 chow 0.575637 True pomeranian 0.195950 True norwich_terrier 0.141224 True https://twitter.com/dog_rates/status/746131877086527488/photo/1
1729 820749716845686786 https://pbs.twimg.com/media/C2PjgjQXcAAc4Uu.jpg 2 golden_retriever 0.838012 True pekinese 0.056733 True labrador_retriever 0.023944 True https://twitter.com/dog_rates/status/820749716845686786/photo/1,https://twitter.com/dog_rates/status/820749716845686786/photo/1
1266 749395845976588288 https://pbs.twimg.com/media/CmZjizYW8AA3FCN.jpg 1 pomeranian 0.973715 True chow 0.020758 True keeshond 0.003784 True https://twitter.com/dog_rates/status/749395845976588288/photo/1,https://twitter.com/dog_rates/status/749395845976588288/photo/1
1303 753026973505581056 https://pbs.twimg.com/media/CnNKCKKWEAASCMI.jpg 3 pembroke 0.868511 True cardigan 0.103708 True shetland_sheepdog 0.018142 True https://twitter.com/dog_rates/status/753026973505581056/photo/1,https://twitter.com/dog_rates/status/753026973505581056/photo/1,https://twitter.com/dog_rates/status/753026973505581056/photo/1,https://twitter.com/dog_rates/status/753026973505581056/photo/1
985 707610948723478529 https://pbs.twimg.com/media/CdHwZd0VIAA4792.jpg 1 golden_retriever 0.383223 True cocker_spaniel 0.165930 True chesapeake_bay_retriever 0.118199 True https://twitter.com/dog_rates/status/707610948723478529/photo/1

Define:

  • Reassign predictions_w_url to predictions_clean and drop url column (for tidiness)

Code:

In [58]:
predictions_clean = predictions_w_url.copy().drop('expanded_urls', axis=1)

Test:

In [59]:
predictions_clean[predictions_clean['jpg_url'].duplicated()]
Out[59]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog

Define:

  • Create new table to separate dog info from tweet info. There will be three tables in the end: tweets, dogs, and predictions.

Code:

In [60]:
tweets = full_archive[['tweet_id', 'timestamp', 'text', 'expanded_urls', 'favorite_count', 'retweet_count']]
dogs = full_archive[['tweet_id', 'name', 'rating_numerator', 'rating_denominator', 'doggo', 'floofer', 'pupper', 'puppo']]

Test:

In [61]:
tweets.head()
Out[61]:
tweet_id timestamp text expanded_urls favorite_count retweet_count
0 892420643555336193 2017-08-01 16:23:56 This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 38145 8346
1 892177421306343426 2017-08-01 00:17:27 This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV https://twitter.com/dog_rates/status/892177421306343426/photo/1 32716 6168
2 891815181378084864 2017-07-31 00:18:03 This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB https://twitter.com/dog_rates/status/891815181378084864/photo/1 24632 4082
3 891689557279858688 2017-07-30 15:58:51 This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ https://twitter.com/dog_rates/status/891689557279858688/photo/1 41490 8488
4 891327558926688256 2017-07-29 16:00:24 This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 39655 9181
In [62]:
dogs.head()
Out[62]:
tweet_id name rating_numerator rating_denominator doggo floofer pupper puppo
0 892420643555336193 Phineas 13 10 None None None None
1 892177421306343426 Tilly 13 10 None None None None
2 891815181378084864 Archie 12 10 None None None None
3 891689557279858688 Darla 13 10 None None None None
4 891327558926688256 Franklin 12 10 None None None None

Define:

  • Melt doggo/floofer/pupper/puppo columns. Drop variable column, leaving column of labels and None values.
In [63]:
def stage_sum(df):
    '''
    Takes in a dataframe, turns all string-literal 'None's to 0s, creates a new dummy column called 'multiple', 
    then calculates the sum of all four stage columns to assign values to 'multiple' based on their boolean sum.
    Returns the new dataframe. Please feel free to suggest a better fix than this. I can't think of one that doesn't
    require some kind of None or Boolean, and replace() doesn't actually work with the null None.
    '''
    
    df = df.replace('None', 0)
    
    df['multiple'] = 0
    
    for i in df.index:
        doggo = bool(df.loc[i, 'doggo'])
        floofer = bool(df.loc[i, 'floofer'])
        pupper = bool(df.loc[i, 'pupper'])
        puppo = bool(df.loc[i, 'puppo'])
        stage_sum = doggo + floofer + pupper + puppo
        if stage_sum > 1:
            df.loc[i, 'multiple'] = 'multiple'
    
    df = df.replace(0, 'None') # If anyone has advice on how to avoid this double replacement call in particular, let me know!
    
    return df
            
dogs = stage_sum(dogs)
In [64]:
dogs = pd.melt(dogs, id_vars = ['tweet_id', 'name', 'rating_numerator', 'rating_denominator'], var_name="variable", value_name="stage").drop('variable', axis=1)
In [65]:
dogs['stage'] = dogs['stage'].astype('category')
dogs['stage'] = dogs['stage'].cat.set_categories(['None', 'doggo', 'floofer', 'pupper', 'puppo', 'multiple'], ordered=True)
dogs = dogs.sort_values(by='stage')
In [66]:
dogs = dogs.drop_duplicates(subset='tweet_id', keep='last') # Needed to prevent each tweet/dog having 2-4 redundant rows!

Test:

In [67]:
dogs.stage.value_counts()
Out[67]:
None        1735
pupper      219 
doggo       69  
puppo       23  
multiple    11  
floofer     9   
Name: stage, dtype: int64

Final results:

In [68]:
tweets.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 0 to 2355
Data columns (total 6 columns):
tweet_id          2066 non-null int64
timestamp         2066 non-null datetime64[ns]
text              2066 non-null object
expanded_urls     2063 non-null object
favorite_count    2066 non-null int64
retweet_count     2066 non-null int64
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 193.0+ KB
In [69]:
dogs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2066 entries, 4209 to 8803
Data columns (total 5 columns):
tweet_id              2066 non-null int64
name                  1985 non-null object
rating_numerator      2066 non-null int64
rating_denominator    2066 non-null int64
stage                 2066 non-null category
dtypes: category(1), int64(3), object(1)
memory usage: 82.9+ KB
In [70]:
predictions_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1969 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    1969 non-null int64
jpg_url     1969 non-null object
img_num     1969 non-null int64
p1          1969 non-null object
p1_conf     1969 non-null float64
p1_dog      1969 non-null bool
p2          1969 non-null object
p2_conf     1969 non-null float64
p2_dog      1969 non-null bool
p3          1969 non-null object
p3_conf     1969 non-null float64
p3_dog      1969 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 159.6+ KB
In [71]:
tweets.to_csv('twitter_archive_master.csv')
In [72]:
dogs.to_csv('dogs_archive_master.csv')
In [73]:
predictions_clean.to_csv('predictions_archive_master.csv')

Part 2: Analysis and Visualization

After cleaning all this data, I can think of a few things I could look at for analysis: retweet count and favorite count over time; dog rating by "type"; and predictions by breed all seem interesting.

Predictions by breed:

In [74]:
predictions_clean.describe() # The highest confidence levels are mostly found in p1, so I'll focus on that.
Out[74]:
tweet_id img_num p1_conf p2_conf p3_conf
count 1.969000e+03 1969.000000 1969.000000 1.969000e+03 1.969000e+03
mean 7.359824e+17 1.201117 0.594365 1.345765e-01 6.019426e-02
std 6.752421e+16 0.557815 0.272135 1.010184e-01 5.095331e-02
min 6.660209e+17 1.000000 0.044333 1.011300e-08 1.740170e-10
25% 6.758531e+17 1.000000 0.362925 5.351500e-02 1.610520e-02
50% 7.088109e+17 1.000000 0.587507 1.173970e-01 4.944380e-02
75% 7.880396e+17 1.000000 0.847292 1.955580e-01 9.157150e-02
max 8.924206e+17 4.000000 1.000000 4.880140e-01 2.734190e-01
In [75]:
predictions_clean.p1.value_counts()
predictions_clean[predictions_clean['p1_conf'] > 0.75].p1.value_counts()
Out[75]:
golden_retriever                  78
pembroke                          47
labrador_retriever                40
pug                               34
chihuahua                         25
pomeranian                        22
samoyed                           22
chow                              16
french_bulldog                    14
toy_poodle                        13
web_site                          11
malamute                          10
german_shepherd                   9 
maltese_dog                       8 
seat_belt                         8 
shetland_sheepdog                 8 
chesapeake_bay_retriever          7 
bernese_mountain_dog              7 
miniature_pinscher                6 
blenheim_spaniel                  6 
brittany_spaniel                  6 
great_pyrenees                    5 
rottweiler                        5 
pekinese                          5 
doberman                          5 
basset                            5 
cocker_spaniel                    5 
american_staffordshire_terrier    5 
dalmatian                         5 
tennis_ball                       5 
                                 .. 
mortarboard                       1 
platypus                          1 
wild_boar                         1 
papillon                          1 
space_heater                      1 
crash_helmet                      1 
briard                            1 
leopard                           1 
brown_bear                        1 
mousetrap                         1 
english_setter                    1 
slug                              1 
kelpie                            1 
goose                             1 
washer                            1 
scorpion                          1 
cash_machine                      1 
school_bus                        1 
afghan_hound                      1 
gordon_setter                     1 
prison                            1 
ibizan_hound                      1 
studio_couch                      1 
clumber                           1 
malinois                          1 
hog                               1 
leonberg                          1 
rhodesian_ridgeback               1 
box_turtle                        1 
conch                             1 
Name: p1, Length: 179, dtype: int64
In [76]:
137/1969 # Ratio of golden retrievers to all items in the full p1 column
Out[76]:
0.06957846622651091
In [77]:
78/690 # Ratio of golden retrievers to all items where p1_conf > .75 -- definitely higher
Out[77]:
0.11304347826086956
In [78]:
94/1969 # Ratio of labrador retrievers to all items in the full p1 column
Out[78]:
0.047739969527679026
In [79]:
40/690 # Ratio of labrador retrievers to all items where p1_conf > .75 -- slightly higher
Out[79]:
0.057971014492753624
In [80]:
88/1969 # Ratio of pembrokes to all items in the full p1 column
Out[80]:
0.0446927374301676
In [81]:
47/690 # Ratio of pembrokes to all items where p1_conf > .75 -- also slightly higher
Out[81]:
0.06811594202898551
In [82]:
# I also want to know about the ratio of non-dogs to dogs at high confidence levels:

predictions_clean.p1_dog.value_counts()
predictions_clean[predictions_clean['p1_conf'] > 0.75].p1_dog.value_counts()
Out[82]:
True     538
False    152
Name: p1_dog, dtype: int64
In [83]:
507/1969 # Ratio of non-dogs to all items in the full p1 column
Out[83]:
0.2574911122397156
In [84]:
152/690 # Ratio of non-dogs to all items where p1_conf > .75 -- I thought it would be higher!
Out[84]:
0.22028985507246376

What can I take away from this? Assuming confidence as a proxy for accuracy, it seems as though the neural network is especially good at predicting golden retrievers, fairly good at other common breeds, and not as good as I expected at picking out things that aren't dogs.

Dog rating by "stage":

In [85]:
dogs.groupby('stage').rating_numerator.mean()
Out[85]:
stage
None        12.380980
doggo       11.826087
floofer     11.888889
pupper      10.794521
puppo       12.000000
multiple    11.272727
Name: rating_numerator, dtype: float64

This one-line function tells us that of the dogs who have "stages," puppos tend to rate the highest -- while, oddly enough, puppers rate the lowest!

Retweet/favorite counts over time:

In [86]:
tweets.head()
Out[86]:
tweet_id timestamp text expanded_urls favorite_count retweet_count
0 892420643555336193 2017-08-01 16:23:56 This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU https://twitter.com/dog_rates/status/892420643555336193/photo/1 38145 8346
1 892177421306343426 2017-08-01 00:17:27 This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV https://twitter.com/dog_rates/status/892177421306343426/photo/1 32716 6168
2 891815181378084864 2017-07-31 00:18:03 This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB https://twitter.com/dog_rates/status/891815181378084864/photo/1 24632 4082
3 891689557279858688 2017-07-30 15:58:51 This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ https://twitter.com/dog_rates/status/891689557279858688/photo/1 41490 8488
4 891327558926688256 2017-07-29 16:00:24 This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 39655 9181
In [87]:
tweets_by_time = tweets[['timestamp', 'favorite_count', 'retweet_count']]
tweets_by_time = tweets_by_time.set_index('timestamp')
In [88]:
tweets_by_time.head()
Out[88]:
favorite_count retweet_count
timestamp
2017-08-01 16:23:56 38145 8346
2017-08-01 00:17:27 32716 6168
2017-07-31 00:18:03 24632 4082
2017-07-30 15:58:51 41490 8488
2017-07-29 16:00:24 39655 9181
In [89]:
tweets_by_year = tweets_by_time.resample('A').median()
tweets_by_year
Out[89]:
favorite_count retweet_count
timestamp
2015-12-31 1214.5 477.0
2016-12-31 4742.5 1513.0
2017-12-31 20153.5 4101.5
In [90]:
tweets_by_quarter = tweets_by_time.resample('Q').median()
tweets_by_quarter
Out[90]:
favorite_count retweet_count
timestamp
2015-12-31 1214.5 477.0
2016-03-31 2946.0 976.0
2016-06-30 4590.5 1475.0
2016-09-30 6794.0 2094.0
2016-12-31 10831.0 3048.0
2017-03-31 14351.0 3236.0
2017-06-30 22373.0 4614.5
2017-09-30 27541.0 5289.0
In [91]:
tweets_by_month = tweets_by_time.resample('MS').median()
tweets_by_month
Out[91]:
favorite_count retweet_count
timestamp
2015-11-01 545.0 228.0
2015-12-01 1800.0 684.0
2016-01-01 2773.0 922.0
2016-02-01 2805.0 976.0
2016-03-01 3266.0 1021.5
2016-04-01 4191.0 1351.5
2016-05-01 4177.0 1300.0
2016-06-01 5722.0 1883.0
2016-07-01 5711.5 1769.5
2016-08-01 6704.0 1925.0
2016-09-01 9400.5 2949.0
2016-10-01 10629.0 3146.0
2016-11-01 10853.0 2918.0
2016-12-01 10831.0 2864.0
2017-01-01 12992.0 2876.0
2017-02-01 14126.0 3029.0
2017-03-01 18913.0 3925.0
2017-04-01 20857.5 4866.5
2017-05-01 22258.0 5110.0
2017-06-01 23479.0 4149.0
2017-07-01 27490.0 5278.0
2017-08-01 35430.5 7257.0
In [92]:
tweets_by_week = tweets_by_time.resample('W').median()
tweets_by_week
Out[92]:
favorite_count retweet_count
timestamp
2015-11-15 128.0 47.0
2015-11-22 405.0 175.0
2015-11-29 695.0 287.0
2015-12-06 1208.5 525.0
2015-12-13 1474.0 524.0
2015-12-20 2264.0 846.0
2015-12-27 2775.0 1018.0
2016-01-03 3340.0 1113.0
2016-01-10 3166.0 1157.0
2016-01-17 2451.5 771.0
2016-01-24 2562.0 845.0
2016-01-31 2802.5 875.0
2016-02-07 3044.5 997.0
2016-02-14 2654.5 985.5
2016-02-21 2758.0 770.0
2016-02-28 3359.0 1060.0
2016-03-06 3123.0 984.0
2016-03-13 2888.0 947.5
2016-03-20 3087.0 953.5
2016-03-27 4428.5 1335.0
2016-04-03 3913.0 1232.0
2016-04-10 3535.5 1122.0
2016-04-17 4316.0 1334.0
2016-04-24 4380.0 1678.0
2016-05-01 4360.0 1449.0
2016-05-08 3946.5 1209.0
2016-05-15 4200.0 1339.5
2016-05-22 3858.0 1046.0
2016-05-29 4204.0 1341.0
2016-06-05 5530.0 1771.0
... ... ...
2017-01-15 12998.0 3433.0
2017-01-22 13772.5 2898.0
2017-01-29 13685.5 3396.0
2017-02-05 13261.0 2811.0
2017-02-12 25918.0 6365.0
2017-02-19 12586.0 2484.0
2017-02-26 16940.5 3348.0
2017-03-05 17672.0 3287.0
2017-03-12 14071.0 2462.0
2017-03-19 17631.0 3925.0
2017-03-26 19762.5 4176.5
2017-04-02 21247.0 4555.0
2017-04-09 21138.0 4565.0
2017-04-16 16822.0 3731.0
2017-04-23 29266.0 6519.0
2017-04-30 17864.0 3554.5
2017-05-07 17886.0 4026.0
2017-05-14 19970.0 5102.5
2017-05-21 27284.5 6715.5
2017-05-28 26719.0 5393.0
2017-06-04 22472.0 4149.0
2017-06-11 27038.0 5364.0
2017-06-18 24981.0 4418.5
2017-06-25 22305.5 4360.5
2017-07-02 23066.0 4149.0
2017-07-09 27669.0 4863.0
2017-07-16 26563.0 5548.0
2017-07-23 29886.0 5573.5
2017-07-30 27357.0 5278.0
2017-08-06 32716.0 6168.0

91 rows × 2 columns

In [93]:
tweets_by_day = tweets_by_time.resample('D').median()
tweets_by_day
Out[93]:
favorite_count retweet_count
timestamp
2015-11-15 128.0 47.0
2015-11-16 185.0 84.0
2015-11-17 232.0 82.0
2015-11-18 334.0 128.0
2015-11-19 453.0 221.0
2015-11-20 481.5 246.5
2015-11-21 576.0 294.0
2015-11-22 494.5 227.0
2015-11-23 582.0 308.0
2015-11-24 720.0 338.0
2015-11-25 553.0 203.5
2015-11-26 372.5 111.0
2015-11-27 710.0 277.0
2015-11-28 812.0 377.0
2015-11-29 840.0 291.0
2015-11-30 890.0 301.0
2015-12-01 1208.5 561.0
2015-12-02 1294.5 553.0
2015-12-03 1273.0 590.0
2015-12-04 1190.0 470.0
2015-12-05 1447.0 650.0
2015-12-06 1357.0 569.0
2015-12-07 1284.5 433.5
2015-12-08 1450.0 519.5
2015-12-09 1203.5 486.0
2015-12-10 1299.0 490.5
2015-12-11 2008.0 818.0
2015-12-12 1580.0 499.0
2015-12-13 2367.0 1013.0
2015-12-14 1875.5 606.5
... ... ...
2017-07-03 36876.5 6916.5
2017-07-04 36588.0 8147.0
2017-07-05 27481.0 5988.0
2017-07-06 25721.0 4347.5
2017-07-07 29454.5 5096.5
2017-07-08 45191.0 9774.0
2017-07-09 20791.5 3161.5
2017-07-10 26563.0 5559.0
2017-07-11 25706.0 5097.0
2017-07-12 48912.0 11134.5
2017-07-13 35412.0 6319.0
2017-07-14 32131.0 6672.0
2017-07-15 24197.5 4659.0
2017-07-16 16971.0 3802.5
2017-07-17 34597.0 7630.0
2017-07-18 31614.0 8038.0
2017-07-19 45549.0 11463.0
2017-07-20 21416.0 3430.0
2017-07-21 NaN NaN
2017-07-22 22328.0 3853.0
2017-07-23 28630.0 4412.0
2017-07-24 19855.5 3742.0
2017-07-25 27346.0 4884.0
2017-07-26 30788.0 7225.0
2017-07-27 27357.0 4196.0
2017-07-28 37998.5 10275.0
2017-07-29 29782.0 6119.5
2017-07-30 41490.0 8488.0
2017-07-31 24632.0 4082.0
2017-08-01 35430.5 7257.0

626 rows × 2 columns

The above code shows the median values of favorite_count and retweet_count by year, quarter, month, week, and day. With this, we can see how these counts build up over time. The day-to-day counts are inconsistent, but start to show more consistent growth aggregated by week. Month, quarter, and year all show very consistent growth.

We can also see that favorites don't just happen more than retweets, their rate also grows faster. For example, looking at the quarterly aggregate, there are just under 3 times as many favorites as retweets in the first quarter, but over 5 times as many in the last quarter. We can visualize this data below:

In [94]:
tweets_by_year.plot();
In [95]:
tweets_by_quarter.plot();
In [96]:
tweets_by_month.plot();
In [97]:
tweets_by_week.plot();
In [98]:
tweets_by_day.plot();

Our tweets_by_week plot is (arguably) probably the most useful of these, so I'll use this one and pretty it up a bit.

In [99]:
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
plt.plot(tweets_by_week.index, tweets_by_week.favorite_count, linewidth=2.0, label='Favorites');
plt.plot(tweets_by_week.index, tweets_by_week.retweet_count, linewidth=2.0, label='Retweets');
ax.spines['top'].set_visible(False); # Getting rid of top spine
ax.spines['right'].set_visible(False); # Getting rid of right spine
ax.set_xlim('2015-11-15', '2017-08-15'); # Getting rid of whitespace
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tick_params(axis='both', which='both', bottom=False, top=False, 
                labelbottom=True, left=False, right=False, labelleft=True); # Hiding unnecessary tick marks
plt.ylabel('Tweets', fontsize=16);
plt.title('Favorites and Retweets by Week', fontsize=22);
plt.legend();
plt.grid(True, alpha=0.25); # Showing a light gray grid to help visual alignment
plt.savefig('tweets_by_week.png', bbox_inches='tight')
plt.show()
In [ ]: