Calculating and Visualizing Offer Similarity

When a user visits a TrialPay touchpoint, they’re shown a variety of offers from our advertisement network they can complete in order to earn some virtual good for free (e.g., anti-virus software, Fandango movie tickets, or in-game virtual currency).  Our data science team has been working to determine whether certain offers are more similar than others, and whether offers can be sorted into “communities” of offers that tend to be completed by the same people, using only user clickstream data.

Visualization by Dave Holtz. Adapted from a visualization by Mike Bostock

The graphic above is a visual representation of the calculated similarity between (and subsequent clustering of) different offers in TrialPay’s advertisement network.  An interactive version can be found here.  Similarity between offers is calculated by constructing a completor-offer matrix. Each row in the matrix corresponds to a TrialPay offer completor, and each column corresponds to a TrialPay offer. The cosine similarity between each column is calculated to build an offer-offer similarity matrix.

A naive way to visualize this data is to construct a network diagram, wherein each node corresponds to an offer, and each (weighted) link corresponds to their calculated similarity. However, once data becomes large (as TrialPay’s is), network diagrams tend to become inscrutable hairballs — “hairballs” might make a great addition to the MOMA‘s latest collection of abstract art, but it won’t be help you make useful insights into user behavior!

A helpful alternative to the network diagram is representing the network as an adjacency matrix, where each cell Aij  represents a link from node i to node j. This is the approach we’ve taken above.  Again, in our case nodes correspond to offers, and the links correspond to the strength of their similarity in the offer-offer matrix. The more intense the coloring of the cell, the more similar the offers.

You might have noticed that the cells in this adjacency matrix visualization are colored differently. Color corresponds to different ‘clusters’ or subcommunities of offers. Clusters are calculated using the R package iGraph‘s implementation of the Walktrap community-detection algorithm [1], with nsteps = 4. Clustering offers allows us to answer the question, “What larger groups of offers tend to be completed in tandem?” A logical advancement in this line of thinking is to create user profiles corresponding to the offer subcommunities. A few prominent ones present in the above dataset include:

People with wanderlust:

  • These users complete offers like Orbitz Vacation PackagesSmart DestinationsOrbitz Hotels, and Funjet Vacations. With all of that traveling, its a marvel that these people have time to get free online goods through TrialPay!

People who love sharing the details of their medical history

  • These users complete offers like the Gout Survey, the Rheumatoid Arthritis Survey, the Pediatric Depression Survey, and the Crohn’s Disease Survey. These are the TrialPay equivalent of that one uncle at Thanksgiving who loves to show you those fresh stitches from his gnarly surgery in October.

People who spend a lot of time around the house, but still want to have fun

  • These users complete offers like Flirty ApronsCreative Girls ClubAnnie’s Creative Painting Kit Club, and Amora Coffee. They’re probably spending a lot of time at home with the kids (hence the need to stay fueled with some Amora Coffee!). But hey, just because you’re taking care of household chores doesn’t mean you don’t want to spice things up now and again with a Flirty Apron…right? If Danny Tanner of Full House completed offers with TrialPay (and we like to think he would’ve had the internet existed in the days of TGIF), he’d probably fall into this group.

The interactive version of this visualization allows you to re-order the matrix’s rows and columns.  Re-ordering the rows and columns of the adjacency matrix can surface fascinating insights into TrialPay user behavior. This visualization allows you to try different orderings via the drop-down menu. There are four ways to organize the rows and columns.

  • What’s in a name? This option sorts the rows and columns by the offer name.
  • By frequency: “What’s the frequency, Kenneth?” This option sorts the rows and columns by how many offers a node has some connection to (i.e., how many offers have non-zero similarity to this one?).
  • By cluster: “Ain’t nobody fresher than my mother***in’ clique!” This option sorts the rows by their cluster / subcommunity.
  • By offer type: What’s your type? Here at TrialPay, we assign offers a type (such as cost per lead or cost per sale). This option sorts the rows by type.

It’s noticeable that sorting by characteristics with stronger correlation to offer similarity (e.g., cluster) produce more highly diagonalized adjacency matrices (i.e., most non-zero values appearing near the diagonal). Characteristics that don’t correlate with similarity (e.g., name) show no such trend.

Even though the adjacency matrix is an effective way to visualize this information, it still doesn’t scale well to data as large as ours (thousands of offers). In order to procure a more digestible data set, the above visualization only shows offers meeting certain selection criteria pertaining to value, similarity, and reward currency.

[1] Pascal Pons, Matthieu Latapy: Computing communities in large networks using random walks

Posted in Data Science | Leave a comment

Interviewing At TrialPay 101

I’m thinking about applying for an engineering job at TrialPay. What are TrialPay’s engineering interviews like?

Our interviews include phone interviews– sometimes with shared Google Doc for writing pseudo-code– and in-person sessions at our Palo Alto office. You’ll meet our wonderful technical recruiter Karen Tjhan  and members of our elite engineering squad.

Karen T- Technical Recruiter

Please excuse her manic grin.

We’ve assembled a collection of interactive interview questions that show you how we work and provide you the opportunity to demonstrate your skills. We’ve selected our questions to test five different areas. Some questions focus just on one area; other questions cover multiple (or all!) areas.

We test for:

  • Basic Algorithms and Time/Space Complexity knowledge. Here’s a little secret: Just about nobody six months out of school remembers how to implement a Bloom Filter or a splay tree without looking it up online. This includes your interview panel. So you probably won’t get asked about these. That said, you should be comfortable with basic data structures and algorithms that show up regularly in the profession.
  • Do you like computers? If you do, then the practical knowledge you’ve acquired through playing around with the OS, tinkering on the web, etc, will shine through during your interviews.
  • Architecture a.k.a. open-ended problems. When you’ve got some architectural options, show us that you can weigh out the pros and cons.
  • Coding and Attention to Detail. Keep your cool when getting into the details.
  • Communication. Because our interviews are highly interactive, all the questions provide an opportunity for you to showcase your brilliant communication ability!

We do not test for:

  • Brainteaser solving ability. We do not like questions whose crux is a single clever flash of insight. While brainteasers can be fun, there’s too much randomness and instability in solving these for them to be a reliable indicator of professional success.

We have a high bar, and we pride ourselves on our rigorous standards. If you’ve got the right talent and passion, we can’t wait for you to join our team!

TP High

Come join our team of said esteemed engineers
PS. We don’t actually dress like this on a normal basis… this was for a themed day at work. We promise.

Curious to find out more? Come interview with us! Apply here or send a shoutout to jobs-eng[at]trialpay[dot]com

Posted in Engineering | Leave a comment

TrialPay Engineering Challenge: The Snake Cube Puzzle

At TrialPay, our engineers love puzzles. When we aren’t busy solving interesting problems building a robust, feature-rich platform, we like to stay sharp by challenging one another with unique puzzles that allow us to think about our technical challenges in different ways.

If you’re up for the challenge below, you can submit your solution to eng-puzzles@trialpay.com.

If you submit a good solution and send along a GitHub/StackOverflow/LinkedIn profile or resume, we will contact you for an interview.

Snake Cube
Recently, one of our engineers introduced us to the snake cube. A snake cube is a puzzle composed of a chain of cubelets, connected by an elastic band running through each cubelet. Each cubelets can rotate 360°about the elastic band, allowing various structures to be built depending upon the way in which the chain is initially constructed, with the ultimate goal of arranging the cubes in such a way to create a cube.

Consider the 27-cubelet snake, which assembles into a 3*3*3 cube, shown below.

This particular arrangement contains 17 groups of cubelets, composed of 8 groups of two cubelets and 9 groups of three cubelets. This arrangement can be expressed in a variety of ways, but for the purposes of this exercise, let ’0′ denote a pieces whose rotation does not change the orientation of the puzzle, or may be considered a “straight” piece, while ’1′ will denote pieces whose rotation changes the puzzle configuration, or “bend” the snake. Using that schema, the snake puzzle above could be described as

001110110111010111101010100

Your challenge is to write a program, in any language of your choosing, that takes the cube dimensions (X, Y, Z) and a binary string as input, and outputs ’1′ (without quotes) if it is possible to solve the puzzle, i.e. construct a proper X*Y*Z cube given the cubelet orientation, and ’0′ if the current arrangement cannot be solved.

snake_cube_solution(3, 3, 3, “001110110111010111101010100″) == 1

Bonus points are given for printing out the final x,y,z coordinates for each cubelet, in order, assuming that the first block of the cube always begins at (1,1,1) and the final cubelet always ends with (X,Y,Z).

Posted in Uncategorized | Leave a comment

Security at TrialPay

Our CTO, Eddie Lim, recently gave a tech talk to the Stanford ACM club about security. BTW, if you have the opportunity, you should definitely come speak at one of their sessions — we were really impressed by the depth of the questions that the students were asking, especially on a topic that isn’t necessarily something you might deal with every day!

For context, TrialPay works on driving traffic to advertisers by giving users something that they already want for free. Our platform currently gets over 100 million views per day, and in the last 6 years, over 200 million people have gone through TrialPay to get free stuff. Security is a huge deal for us for a few reasons:

  1. Some subset of our users is always trying to game the system and get free stuff.  Just google “hacking trialpay”, and you’ll see what we mean.
  2. We have a new product we’re working on that aims to connect the online and offline worlds, and part of the implementation involves storing credit card numbers.  Take a look at this demo to get an idea of what the product is; basically users get instant rewards online for offline actions, and with our help, brick and mortar businesses can draw causal links between their online advertising campaigns and offline sales. With our help, companies will finally be able to answer burning questions like, “Did that old-timey looking product photo we posted last week actually get people in our store?” Social media experts beware.

In our talk we focused on 2 main topics: securing user authentication (specifically, two-factor for both VPN and SSHD) and the architecture surrounding our credit card vault. You can find the video of Eddie’s talk below, as well as slides that you can use to follow along!  If you want more info about two-factor authentication or want to see some code samples, please see our previous posts about VPN and SSHD.

Posted in Uncategorized | Leave a comment

TrialPay’s world of monitors — the tech inside

When TrialPay moved into iphoto 2ts new office in June 2012, we wanted to provide digital signage throughout the office for employees and visitors to see ever-changing information about our office and business.

We’ve built a number of internal graphs and tickers, as we call them, ranging from showing transaction volume and locations to server and bandwidth loads. There’s some fun stuff in there too — like what’s being served for lunch and dinner during the rest of the month, new employee highlights, and the all important fitness program calendar. Visitors never seem to be able to get enough enjoyment out of them!

photo 6With 23 LCD displays mounted on the walls ranging from 42″ to 55″ in our ~27,000 square foot office, we needed to find a way to find a way to get various video sources to all the displays. Our office was being built from scratch, so our contractors installed CAT5e cable to each television location from the AV closet and used commercial grade HDMI baluns on each end to support long cable runs (the longest run is probably 150 feet from our AV closet.) Installed neatly on a two-post rack in that closet, we patch each HDMI cable to Mac minis. Each Mac mini has a mini-DisplayPort out and HDMI out, we adapt the former back to HDMI, and we’ve got two 1920×1080 outputs from each Mac.

photo 5

But what if we want to display the same content on more than one TV, and not have to manage another Mac? We’re using a number of generic HDMI splitters and they do the job. There are fancier matrix switches around too if you want to be able to plug all sources and outputs into one box, but that would have added thousands to our costs. Since we aren’t changing what display shows what Mac often, there wasn’t any need. The Mac minis aren’t too far from the IT area, but we can remotely manage them via Apple Remote Desktop (or go low tech with a bluetooth keyboard). I wasn’t able to justify the expense of a networked KVM or console server given these aren’t running in an environment where five nines are needed (though sometimes I think our CEO would like that).

Nonetheless, the Macs are pretty stable. After turning off all the energy saver functions, finding the right screen resolution (1080i NTSC, which worked best with the baluns), configuring ARD/screen sharing, and installing Firefox, we were pretty much ready to go. Along the way, we’ve found some interesting tools, such as a tab rotator for Firefox to make showing content easier.

Here’s a picture of our marquee 55″ screens right by our social area — these really get lots of foot traffic so we reserve these for some of the coolest stuff TrialPay’s got going!

photo 7

Posted in Uncategorized | Leave a comment

Leveraging Google Apps email to set up two-factor authentication for VPN

Two-factor authentication/2FA has become a necessary first step in keeping our network secure. As somone who VPNs a few times daily, using traditional 2FA solutions with hardware tokens or phone tokens like Google Authenticator would have been a major hassle. Plus having to work with every user to register yet another token would have been another burden on IT.

I wanted to explore options that could provide the same level of 2FA security without the hassle factor. Since all our users were already using Google Apps for email and had two-factor enabled on it, wouldn’t it be nice to just leverage that!

Normal VPN Auth flow

  1. User -> VPN Device
  2. VPN Device —radius—> Auth Server
  3. Auth Server OK -> VPN Device

VPN Auth flow w/ Google Apps 2FA

  1. User 1click visits a Google App Engine hosted site https://xxxxxxxx.trialpay.com which auto logs you in on browsers where you already read your Google Apps email.  This page just shows a 60 second timer for the user to complete the rest of the VPN login process.
  2. User -> VPN Device
  3. VPN Device —radius—> Auth Server
  4. Auth Server –curl–> https://xxxxxxxx.trialpay.com/?didulogin=$user
  5. Auth Server OK -> VPN Device


Auth Server config:

You must have a working freeradius2 server Authentication setup already. We’re just going to add an extra Authorization layer. Finally get to use one more A in AAA!

yum install perl-WWW-Curl freeradius2-perl

1. In  /etc/radb/sites-enabled/default  add “perl” to the authorize section

 authorize {
     perl

2. In /etc/raddb/modules/perl define the Perl file

perl {
      module = ${confdir}/perlauth.pl

3. perlauth.pl

use strict;
use warnings;
use WWW::Curl::Easy;
use vars qw(%RAD_REQUEST);
use constant RLM_MODULE_REJECT=>0;
use constant RLM_MODULE_OK=>2;
sub authorize {

    my $curl = WWW::Curl::Easy->new;
    $curl->setopt(CURLOPT_TIMEOUT, 5);
    $curl->setopt(CURLOPT_URL, 'http://xxxxxxxx.trialpay.com/didulogin?name='.$RAD_REQUEST{'User-Name'});
    my $response_body = '';
    open(my $fileb, ">", \$response_body);
    $curl->setopt(CURLOPT_WRITEDATA,$fileb);
    my $retcode = $curl->perform;

    #user didn't do 2FA first
    if ($response_body =~ /negative/) { return RLM_MODULE_REJECT; }

    # optional ip matching logic to allow and email InfoSec team of ip mismatch allowed access.
    #if ($response_body =~ /$RAD_REQUEST{'Calling-Station-Id'}/) { return RLM_MODULE_OK; }

    #allow all other errors to pass through as ok
    return RLM_MODULE_OK;
}

Google App Engine setup for https://xxxxxxxx.trialpay.com

Set up App Engine to require and only accept logins from your domain, and upload the Python code and template html file below. The code simply gets the username from google and creates a 60 sec memcache key with the user’s remote IP as value.

  • main.py
#!/usr/bin/env python
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
from google.appengine.dist import use_library
use_library('django', '1.2')
from google.appengine.api import users
from google.appengine.ext import webapp
from google.appengine.api import memcache
from google.appengine.ext.webapp import template
from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp.util import login_required

class MainHandler(webapp.RequestHandler):
  @login_required
  def get(self):
    tmout = 60
    ip = self.request.remote_addr
    user = users.get_current_user()
    template_values = { 'tmout': tmout }
    path = os.path.join(os.path.dirname(__file__), 'index.html')
    self.response.headers.add_header("Set-Cookie", "SACSID=deleted; Expires=Thu, 01-Jan-2001 00:00:00 GMT")
    self.response.out.write(template.render(path, template_values))
    memcache.add(key=user.nickname(), value=ip, time=tmout)

class diduloginHandler(webapp.RequestHandler):
  def get(self):
    name = self.request.get("name")
    data = memcache.get(name)
    if data is not None:
        self.response.out.write(data)
    else:
       self.response.out.write("negative") 

def main():
  application = webapp.WSGIApplication([('/', MainHandler),
                                        ('/didulogin', diduloginHandler)],
                                       debug=True)
  run_wsgi_app(application)

if __name__ == '__main__':
  main()
  • Sample index.html file
<head>
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>
<script type="text/javascript">
$(document).ready(function(){
    var count = {{ tmout }} -1 ;
    countdown = setInterval(function(){
    $(".countdown").html(count);
    count--;
    }, 1000);
});
</script>
</head> <body> <span class="countdown">{{ tmout }}</span> </body>

FAQ & Considerations

  • IP Matching  –  An optional security enhancement.  IP used at xxxxxxxx website occasionally won’t match the vpn client ip – so it’s better to email user and infosec instead of denying access.
  • High Availability – Google App Engine being a PaaS should in theory be highly available, but as with any service it can go down (once in 6 months so far). The Perl auth layer should be designed to be fail-safe. Even then partial failures, such as only memcache layer being unavailable, could happen, so you should have a planned way to disable this.
  • Cost – We choose the SNI SSL cert. option which kept our total cost at $9/month.
Posted in Engineering | Leave a comment

Dual-factor authentication for SSHD

For our production entry-point servers, we wanted to enable dual-factor authentication – meaning that users have to use an SSH key and a password to authenticate to the servers.

The Why
Most of our developers are using SSH keys without a passphrase so that they don’t have to type a password every time they sign in. This poses a security risk: if a laptop is stolen (which has happened to one of our developers), anyone can use the key on the machine to connect to our servers. We looked into using passphrase protected SSH keys, but SSHD does not (and cannot) know whether or not the key used is passphrase protected. This means that passphrases as dual factor authentication would have to be human-enforced. Instead of creating more management overhead, we wanted something that would be enforced systematically by the machine.

The How
SSHD can be configured to support multiple authentication schemes (i.e. key based, password, challenge and response). Although each scheme can be enabled/disabled separately, a user has to pass just one of these mechanisms to authenticate. To enforce a second authentication, it has to be performed after the SSHD authentication step is done.

Our first idea was to make the login shell ask users for their password. Since bash doesn’t use PAM, we’d have to either make a patch to bash or rely on the init script (/etc/profile). The init script setup would cover normal use cases, but it would still be possible to circumvent the 2nd authentication.

Then we found the ForceCommand configuration option and an example of a similar feature.

Now all we needed was a way to check the user’s password. This turned out to be a non-trivial task. We were hoping that there was a command line tool to ask for the user’s password, check the keyboard input against system password database, and return the result.

Here’s what we found.

  • Linux doesn’t have a system command to do this.
  • It can be done via a simple Python script, but we don’t want to support another language.
  • Mac OSX has a chkpasswd command, and Apple open sourced it, but nobody ported it Linux.

So we decided to write a simple version in C.

The Whole Shebang
First, compile and install the chkpasswd command

chkpasswd.c

#include <stdlib.h>
#include <unistd.h>
#include <crypt.h>
#include <stdio.h>
#include <shadow.h>

int main (int argc, char *argv[]) {
  char *plain_passwd;
  char *crypt_passwd;
  char *username;
  struct spwd *shadow;

  if (argc != 2) {
    exit(-1);
  }

  username = argv[1];
  plain_passwd = getpass("password:");

  if ((shadow = getspnam(username)) == NULL) {
    exit(-1);
  }

  crypt_passwd = shadow->sp_pwdp;

  exit(strcmp(crypt(plain_passwd, crypt_passwd), crypt_passwd));
}

Makefile

all: chkpasswd.c
        gcc -o chkpasswd -l crypt chkpasswd.c
install: all
        install -s -m 4755 chkpasswd /usr/local/bin

And configure sshd to force the gate-keeper script.

/etc/ssh/sshd_config

PasswordAuthentication no
ChallengeResponseAuthentication no
ForceCommand /etc/ssh/sshd_gatekeeper.sh

/etc/ssh/sshd_gatekeepr.sh

#!/bin/bash
trap "exit 0" SIGINT

CLIENT_IP=`echo $SSH_CLIENT | awk '{print $1}'`

if [ -n "$SSH_ORIGINAL_COMMAND" ]; then
  /bin/grep -qs "^$CLIENT_IP\$" /etc/ssh/sshd_whitelist
  if [ "$?" -eq "0" ]; then
    $SHELL -c "$SSH_ORIGINAL_COMMAND"
    exit 0
  fi

  CMD=`echo $SSH_ORIGINAL_COMMAND | awk '{print $1}'`
  echo "'$CMD' is not supported"
  logger "'$CMD' failed for user=$USER, IP=$CLIENT_IP"
  exit 0
fi

echo "logging in as $USER"
CNT=0
while [ $CNT -lt 3 ]; do
  /usr/local/bin/chkpasswd $USER
  if [ "$?" -eq "0" ]; then
    $SHELL -l
    exit 0
  fi
  let CNT=CNT+1
done

logger "too many failed attempts: user=$USER, IP=$CLIENT_IP"

* You can white-list certain IPs by adding them to /etc/ssh/sshd_whitelist

Now restart SSHD and try to avoid users for a couple of days (or until they get used to entering a password to sign in)…

Afterthoughts
An SSH key is actually considered “something the user knows“, not “something the user has“, so the above approach is not dual-factor authentication in the strict sense. In reality, however, it’s very close to “something the user has” since few people will be able to memorize it. Most users will keep theirs in a physical storage device, making it behave like a physical object.

Before settling on our current approach, we did try RSA SecureId. It’s a true dual-factor authentication system with “something the user has”, but we decided against it in the end because it was simply too much of a hassle to use the device every time we needed to log into the system.

Another (free) alternative is Google authenticator. It can be integrated with SSHD using a PAM module, but it still has the same requirement that each time we log into the servers, we have to enter the generated code.

Posted in Uncategorized | Leave a comment