Tuesday 10 April 2012

I Thought I Saw A Twittering TweetDeck Monkey


Girl, Unallocated recently asked why I wasn't on Twitter. I'll admit that when Twitter first came out I was pretty skeptical - "Who would want to read about my over-ripe bananas?" and "140 characters isn't nearly enough to insult someone (creatively)". Not having a Twitter capable mobile device was just another nail in the coffin. However, after GU's recent questioning, I've since discovered that a lot of forensicky update notifications occur on Twitter and it's also a great way to stalk meet new people (I am obviously joking ... I don't really care about forensicky updates ;).
Anyhoo, I started looking into a couple of PC Twitter clients called TweetDeck and MetroTwit. A recent Youtube demo video showed MetroTwit using 600 Mb RAM just in normal ops. When you consider that the ancient WinXP PC I'm using has 2 GB RAM and I also like to run SIFT on VMWare, that made the decision pretty easy. Hello, TweetDeck! *I also noticed that Corey Harrell has used it - so I know at least one other person might find this post useful/semi-interesting ;)

I did a brief check after installation, and TweetDeck "only" uses ~100 Mb. A bit steep for 140 character messages, but methinks my trusty old PC can live with that. Also, as TweetDeck has been bought by Twitter, you kind of hope they have their functionality and security all sorted. Except ... recently there was an issue with TweetDeck granting a user access to other people's TweetDeck/Facebook profiles. This user wasn't actively trying to subvert the system, he just opened up TweetDeck and there it all was. So TweetDeck shut down for a while and they (presumably) fixed the issue(s). Not particularly inspiring stuff eh?

Which brings me to the topic of this post - I was initially interested to see what artifacts (if any) TweetDeck leaves behind. And it was while monkeying around with TweetDeck that I discovered that the Twitter web search functions also return some interesting information.

TweetDeck v1.3 (on WinXP)

The first thing TweetDeck wants you to do after installing, is register an email address and password for the TweetDeck interface. Chances are, most people will use the same login/password as their Twitter account. Tsk, tsk.

After launching TweetDeck and adding my Twitter profile, I ran SysInternals TCPView and found that TweetDeck connects via TCP over port 443 (HTTPS) to both "twitter.com" and "amazon.com" domains. Presumably, Twitter has outsourced their storage of tweets/images to Amazon's cloud storage. These connections seemed to timeout if nothing is browsed.
I wasn't going to go hardcore and network capture the traffic because a shortish time ago, in a University not so far away, Mike Wilkinson told my Network Forensics class that intercepting live network traffic is best left to the cats with warrants (in this part of The Empire anyway).

So then I ran SysInternals ProcMon (v2.96) and tried to see if TweetDeck was reading/writing any account info to the Windows Registry ... No such luck! Either I missed the entries (not uncommon) or they just weren't there. I did however, notice that the following file got accessed quite a lot:

C:\Documents and Settings\USER\Local Settings\Application Data\twitter\TweetDeck\localStorage\qrc__0.localstorage

I have obviously changed the profile name to protect the guilty ;)

So firing up our trusty WinHex hex editor revealed the following file header snippet in "qrc__0.localstorage":

SQLite3 DB

Look familiar? Those sneaky TweetDeck Developers!
So I copied the "qrc__0.localstorage" over to SANS SIFT to play around with it a little more (using the SQLite Manager Firefox Plugin) and here's what I found:
- The file contains one SQLite table called "ItemTable"
- "ItemTable" consists of a "rowid", "key" and "value" column fields
- "rowid" seems to increase with use - presumably as key/value fields are updated

Most of the "value" fields appear to be JSON encoded. More information about JSON is available here. But the 1 minute summary is:

- JSON is an acronym for "JavaScript Object Notation"
- JSON is a text based encoding system based on name/value pairs
- Possible value types are: strings (delimited by ""), numbers (base 10, can be negative/decimal fractions/base 10 exponentials), objects, arrays, true, false, null
- An object is defined using { } to enclose a name separated by a colon and then the value.
eg {object_name: object_value}
- An array is defined using [ ] to enclose a series of comma separated values.
eg [array_value1, array_value2]
- It is possible to nest values. So as we will see later, you can have an array of tweet objects each of which has an array of key/value objects. This can make it difficult to sort out all the different brackets and commas.

So what are the actual key/value fields in the "ItemTable" database?

Normally, there are 7 key/value pairs in the "ItemTable" database table:
"tweetdeck_account" = An "email" object representing the TweetDeck user's login (eg "user@email.com").
"columns" = Suspect it represents user column arrangement preferences for the GUI.
"hoard" = An object with it's name derived from the "email" object value (eg "user@email.com"). Used to store a 312 character alphanumeric string which I suspect is a hash/encrypted string (or maybe a group of such strings).
"clients" = Several objects representing account information plus the last 4 TweetDeck search terms.
"__version__" = 2.
"_session" = A 44 character alphanumeric string which probably represents a session key. It seems to change with each launch.
"feeds" = Suspect it has to do with the user defined columns each having its own data connection.

Additional testing revealed that if "Keep Me Signed In" is ticked on the TweetDeck login page and then TweetDeck is closed, these artifacts remain readable in "ItemTable" (using SQLite Manager).
If TweetDeck is closed and "Keep Me Signed In" is NOT ticked, then "ItemTable" appears empty (in SQLite Manager) . However, the data is still appears to be (partially) viewable with WinHex (looking at "qrc__0.localstorage").

By chance, I copied an active/open "qrc__0.localstorage" to SIFT and noticed that there were 2 extra key/value fields in "ItemTable" when TweetDeck is running. These field names are prepended with a ":" (indicating they're in shared memory) and are called:
":CACHED_HASHTAGS"  = Had no values
":CACHED_TWITTER_USERS_V2" = Appears to be a cache of Twitter user profiles that the TweetDeck user is "following". Example format follows:
{"lowercase_handle1":["123456781", "CapitalizedHandle1", "Proper Name1", "http link to profile1 pic"], "lowercase_handle2":["123456782", "CapitalizedHandle2", "Proper Name2", "http link to profile2 pic"]}

Note: The numerical fields turn out to be Twitter user ID numbers.

I was only able to see these additional fields after copying an active "qrc__0.localstorage". The weird thing was, after closing/reopening/closing (staying logged in) TweetDeck, I was still able to see these fields in both SQLite Manager and WinHex. After I subsequently opened/logged out/closed TweetDeck, these fields were no longer visible in SQLite Manager but some of the data still appears viewable with WinHex.
Perhaps copying the active file meant the contents of shared memory were also auto-magically written to the new file copy. The subsequent opening(s) of this database file then showed the previously recorded ":" fields in the database.
And perhaps TweetDeck/SQLite doesn't delete/overwrite all the data in a table when "deleting" a table's contents (eg after logging out). Perhaps it just deletes the addressing mechanism? Hence, while you can't see the table in SQLite Manager, you can still see some data using WinHex? To quote Doris Day et al: "Perhaps, perhaps, perhaps".

OK, so we've taken a look at "ItemTable" - now what?

One of the more interesting things I noticed in "ItemTable" was that the "columns", "clients" and "feeds" fields mention a recurring multiple digit number sequence throughout each. There were also similar looking numbers present in ":CACHED_TWITTER_USERS_V2".
Suspecting this number to be a Twitter user ID number, I went to the Twitter Developer site and found this article regarding the Twitter REST API (API = Application Programming Interface). It lists lots of searches that you can perform from the comfort of your favourite browser. These search queries are run against the Twitter database via a Twitter web interface. The number of searches may be limited per hour and may not be all-inclusive.
So to confirm that the multi-digit number I was seeing in my "ItemTable" table was a valid Twitter user ID number, I typed the following into Firefox on SIFT.

https://api.twitter.com/1/users/lookup.json?user_id=XXXXXXXXX
where XXXXXXXXX = Twitter user ID number

SIFT then auto-magically launched a save dialog for the returned JSON file. I then used gedit to display it and saw a bunch of JSON text. The first key/value field was "id":XXXXXXXXX followed later by the corresponding "name" and "screen_name" values of the TweetDeck user. The JSON also showed various other profile info which is described later in the testing section below.

So, given a viable TweetDeck "qrc__0.localstorage" database file, we can derive a TweetDeck user's Twitter user ID number. We can then use the Twitter REST API with that user ID number to determine that user's handle name. Subsequently, I also ran the lookup query using the numbers found in the ":CACHED_TWITTER_USERS_V2" user profiles. These numbers were found to correspond to their surrounding handle name.


More Twitter REST API Searches

We can also use the Twitter user ID number/handle to perform other queries with the Twitter REST API:

For retrieving a Twitter user's Profile info we can use:
https://api.twitter.com/1/users/show.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/users/show.json?screen_name=TwitterHandle
Note: the "screen_name" argument is not case sensitive
For more info see here.

To list a Twitter user's "follower" ID numbers (eg who is following user_id=XXXXXXXXX?) we can use:
https://api.twitter.com/1/followers/ids.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/followers/ids.json?screen_name=TwitterHandle
For more info see here.

To list a Twitter user's friend ID numbers (eg who is user_id=XXXXXXXXX following?) we can use:
https://api.twitter.com/1/friends/ids.json?user_id=XXXXXXXXX
OR
https://api.twitter.com/1/friends/ids.json?screen_name=TwitterHandle
For more info see here.

Given a list of Twitter user ID numbers (eg from a friend/follower lookup), you can use a comma separated list to retrieve each Twitter user's profile info via:
https://api.twitter.com/1/users/lookup.json?user_id=XXXXXXXXX, YYYYYYYYY, ZZZZZZZZZ
For more info see here.

You can also search Twitter by term:
http://search.twitter.com/search.json?q=TwitterHandle
This will return the last weeks worth of tweets containing the given TwitterHandle (eg tweets at/from/mentioning). It might not return 100% of all possible results. For more info see here.
Alternatively, you can perform a conventional web search at:
http://twitter.com/#!/search
This will return the results within the browser. If you need to record these results, having a JSON file (via the first search method) might make documentation easier rather than a screenshot/printout for the second method.

Additional Resources (Stuff I didn't know where else to put)

This article describes how each tweet has its own unique ID number.
This website allows you to retrieve a Twitter user's handle (eg Batman) from their Twitter user ID number or vice-versa. The website also states that the numeric user ID is like a primary key (unique number) that does not change. So if a user changes their Twitter handle, they can still be found via their unique ID number.
I also found out about these similar sites (here and here) which translate from a Twitter user's handle to a Twitter user's ID number (one way only).
From my limited testing, the results from these sites appear to be consistent with each other. Given a choice however, I would trust the Twitter REST API over these webpages. All that stuff about about straight from the horses mouth etc.
Whilst doing a search for TweetDeck artifacts, I came across this since superseded article. It describes the database schema for a previous version of TweetDeck. It's interesting that the developers have since changed to a more generic (and harder to understand) schema of key/JSON value pairs. An attempt at extra security through obscurity?

Enter the Twit-Monkey!

Manually parsing through all this JSON text was making me cross(eyed) and I vaguely remember Harlan Carvey mentioning he had written a Perl script that read JSONs. Just FYI, I'm going for a name drop record in this post :)
Anyhoo, I thought it would take me as long to read through some of these search results as it would to write a Perl script to print out JSONs in a more readable format (how's that for cocky?). So then I *strutted* to CPAN and found the aptly named JSON Perl package. This package has functions to both encode/decode JSONs and best of all, it has a function to "pretty print" JSONs in a more human readable format.

To install it on SIFT I typed:
"sudo cpan JSON"
and then said yes to any dependency installs.

Now we should be ready to write the script. As usual, my code is pretty hack-tacular but it seems to work OK for each of the search types listed above.

The Code

#CODE BEGINS ON LINE BELOW
#!/usr/bin/perl -w

# Perl script to pretty print JSON files

use strict;

use Getopt::Long;
use JSON;

my $version = "json-printer.pl v2012-04-08";
my $help = 0;
my $filename="";
my @jsonlist;

GetOptions('help|h' => \$help,
            'f=s' => \$filename);

if ($help || $filename eq "")
{
    print("\nHelp for $version\n\n");
    print("Perl script to pretty print JSON files.\n");
    print("Example: json-printer.pl -f search-5.json\n");
    exit;
}

open(my $tfile, "<", $filename) or die "Can't Open $filename JSON File!";
@jsonlist = <$tfile>; # extract each line to a list element
chomp(@jsonlist); # get rid of newlines
close($tfile);

my $json_string = join(' ', @jsonlist); # join list elements into one big happy scalar string

my $json = JSON->new->allow_nonref; # create a new JSON object which converts non-references into their values for encoding
my $perl_scalar = $json->decode($json_string); # converts JSON string into Perl hash(es)

# at this point (if you wanted to) you can add code to iterate thru the hash(es) to extract/use values.

my $pretty_printed = $json->pretty->encode( $perl_scalar ); # re-encode the hash just so we can then pretty print it (hack-tacular!)
print $pretty_printed;

#CODE ENDS ON LINE ABOVE


Code Summary

This short script ("json-printer.pl") lets the JSON Perl package do all the heavy lifting. First, it reads the user specified JSON file into a scalar string ("$json_string") and then it calls "$json->decode" to convert the string into a Perl hash data structure. The script then calls "$json->pretty->encode" to convert that hash data structure into a more human readable string format. It's a round-about way of doing things but it seems to work.

Note: "$perl_scalar" is a reference to a hash data structure. If you wanted to, you could use it to iterate through the hash structure and retrieve any of the individual values.

Code Testing

You'll just have to trust me on this and/or try out the code yourself. The script above is able to read the JSON data returned by the various Twitter REST API searches listed previously.
I don't feel comfortable displaying other people's tweeting information here (eg their user id, what method they use to tweet). Its probably no more detailed than you could get by searching www.twitter.com/search but I just don't think it's my place.

Anyhoo, as some of the REST API searches mentioned previously return information about individual tweets (eg in a user profile's "status" field), I thought it would be interesting to have a look at these returned tweet fields (without divulging real data). Each tweet in a JSON appears to have the following format:

"to_user_id" = Destination Twitter user ID number (in unquoted numeric format).
"source" = Source URL which can show how tweet was posted (eg via twitter.com, tweetdeck, tweethopper etc). Some URL arguments  may also list the device (eg iPad, Android).
"profile_image_url" = HTTP URL showing the source's avatar filename.
"profile_image_url_https" = HTTPS URL showing the source's avatar filename.
"created_at" = Date and Time of tweet creation.
"text" = Tweet text including @ addressing (eg "@Superman How do you keep your tights so clean?").
"id" = Unique tweet ID number (in "" quoted string format).
"in_reply_to_status_id" = When available, shows the tweet ID number this tweet is replying to (in "" quoted string format).
"from_user_id_str" = Source's Twitter user ID number (in "" quoted string format).
"from_user" = Source's Twitter user handle (eg "Batman").
"geo" = Source's(?) location information.
"from_user_id" = Source's Twitter user ID number (in unquoted numeric format).
"id_str" = Unique tweet ID number (in "" quoted string format).
"iso_language_code" = Presumably the language used to tweet (eg "en" = English).
"to_user_id_str" = Destination's Twitter user ID number (in "" quoted string format). Can also be null.
"to_user" = Destination's Twitter user handle string (eg "Superman"). Can also be null.
"from_user_name" = Source's Twitter user personal name string (eg "Bruce Wayne")
"to_user_name" = Destination's Twitter user personal name string (eg "Clark Kent"). Can also be null.

Note: This list is not all inclusive (eg there's some optional reply fields I've left out).

Twitter help uses the term "Username" to denote the addressing mechanism name (eg @Batman). It then uses "Name" to represent your personal name ID (eg "Bruce Wayne"). Got it? Now here's where it might get a little confusing ...
In the above tweet message, the "to_user_name" and "from_user_name" contain the personal ID Names (eg "Bruce Wayne"). The "to_user" and "from_user" contain the addressing mechanism names (eg "Batman").

Thanks to Frank McClain for sending me a "Mice" tweet to analyse/help confirm my findings. Quack-tacular!

Some of the REST API searches mentioned previously may return a Twitter user's Profile information. Here are some selected Profile fields I found interesting:

"friends_count" = How many Twitter users this user is following (in unquoted numeric format).
"profile_image_url" = HTTP URL string showing user's avatar image filename.
"profile_background_image_url_https" = HTTPS URL string showing user's profile background image filename.
"url" = User's website URL string (eg "www.blog.com").
"id" = User's Twitter user ID number (in unquoted numeric format).
"screen_name" = User's handle/screen name (eg "Batman").
"location" = User's location string (eg "Gotham").
"lang" = Presumably the user's tweet language (eg "en" for English).
"followers_count" = How many Twitter users are following this user (in unquoted numeric format).
"name" = User's personal name ID (eg "Bruce Wayne").
"description" = User's witty description of themselves (eg "The Dark Knight").
"status" = Object containing the user's most recent (unprotected) sent tweet (see above tweet format for more details).
"profile_image_url_https" = HTTPS URL string showing user's avatar image filename.
"created_at" = Date and Time when the user's account was created.
"utc_offset" = Appears to be the numeric timezone offset in seconds from GMT (eg -21600).
"profile_background_image_url" = HTTP URL string showing user's profile background image filename.
"protected" = Boolean presumably representing whether the user's tweets are protected (eg true/false).
"id_str" = User's Twitter user ID number (in "" quoted string format).
"time_zone" = Appears to be string based representation of time zone (eg "Central Time (US & Canada)").
"geo_enabled" = Boolean presumably representing if the user's location is disclosed.
"profile_use_background_image" = Boolean representing if the user is using their background profile picture.

Adding to the name confusion mentioned earlier, we now see the "screen_name" and "name" keys. In this case, "screen_name" seems to refer to the addressing mechanism name (eg "Batman") and "name" refers to the personal ID name (eg "Bruce Wayne"). And while not explicitly present in these fields, there's also the commonly used "handle" term (eg "Batman"). Also note that "id" in a tweet message refers to the unique tweet ID number where as "id" in a profile represents a Twitter user's ID number ... Consistency? Pffft! We don't need no stinkin' consistency! Everyone's a critic huh?

Summary

We've installed TweetDeck and seen that it can store/leave behind the last 4 user search terms and multiple Twitter user ID number artifacts. These user ID numbers can then be used to find corresponding Twitter handles/addressing mechanisms via a Twitter REST API web search. The Twitter REST API can also be used to retrieve other information about the user's followers/leaders/profile.
We have also coded a small Perl script ("json-printer.pl") to make it easier to read the JSON formatted results returned from the Twitter REST API.
Any comments/suggestions/requests for further name droppage (the record now stands at 5 / 6 if you count Doris) are welcome.
FYI This will probably be my last post for a few weeks (month?) as I have to swap my PC for some renovation work boots. I will still be contactable via email / Twitter though. Hopefully, I'll be back with some new ideas and all of my digits in the near future ;)