Hackathon 3.0 Project - Overlaying maps with data from Splunk

Our Big (Data Visualization) Idea

Heat map showing sales volume in the NortheastEvery day, Cayan amasses gigabytes of machine-generated data – data about our merchants, their transactions, what items they sold. For our hackathon project, our team wanted to add context to this raw data. Our goal was to help support our sales engineering colleagues to get useful, high level information at a glance by looking at a map. We wanted to give our users a way to view their merchants’ activity over time.

We decided that in order to best make use of the huge amount of data we have, we would need to look at not just a merchant’s location but also time. The plan was to create 2 maps. One a standard map that utilized markers to denote points of interest, the second a heatmap.

For the first map, our approach was to try and show the spread of a Fortune 1000 Company’s 1100+ locations over time. As each store made its inaugural transaction, we put another pushpin on the map. Zoomed out, this would show how their rollout with Cayan moved across North America, state by state. By zooming in on the map, you’d get to see each individual location and details about it (name, zip code, date of first transaction).

The first handful of pilot locations go live
Our second map would be a heatmap showing the merchant’s busy season and contrast this with a less busy period. We initially started with the Christmas season but soon moved to Halloween 2015 as this was a better fit for this particular merchant’s business.

To do all this, we needed to:
 
  • Pull sales transaction details from our data warehouse
  • Match merchants to transactions to identify their ZIP code
  • Find latitude/longitude data from ZIP codes via Google’s Geocoding API
  • Use Google’s Map and heatmap APIs to visualize the data

Splunk Is Your Friend…

All of Cayan’s application logs are indexed in Splunk. You can think of this as a search index or mini data warehouse. Splunk has a RESTful API so we decided to use this as our data source. Our plan was to generate a search in Splunk which brought back everything we needed and then use the API to pull the search result set out and manipulate it in C# code. (Splunk searches can be shared with a special GUID that we can use from our code to request the results). Then, using http://json2csharp.com/ we created classes to parse the results into in-memory objects, with the help of JSON.Net. We planned to use Dapper.Net to get the useful information into our own database table, allowing us to further pre-process the data set without needing to call the RESTful API every time.

…Except When It's Not

By the end of Day One we hit a roadblock with this approach - the Splunk API has a hard limit of only returning 100 results per call. Given that our ideal data set was made up of nearly 5 million records, this approach wasn't going to work. We tried a number of different approaches to work around this limitation. Ultimately, we pivoted to directly accessing a database used for performance testing, which contained enough data for our purposes. This meant we could continue as planned, albeit with a little more SQL work on our side.

Optimizing our Google Map API Calls

Although we had ZIP codes for all of the transactions, the sheer volume of data meant it wouldn't be viable to interact with the Google Map API on a per transaction basis. For each transaction the API would have to look up the ZIP code and figure out the longitude and latitude before plotting it on the map. This was processing time we simply didn’t have, to give us the seamless user experience we wanted for our demo.

To speed this process up, we created a temporary table of each unique ZIP code contained in the transactions. We then looked up the longitude and latitude from the Google Maps Geocoding API and stored these locally. (The API has certain limits such as no more than 10 requests per second and no more than 2500 requests per day. We didn’t want to break these for the hackathon project. By storing the values off, we didn't need to hit the API each time we needed the data.)

Data Aggregation

The heatmap demo only covered a couple of days’ worth of activity, but even then the amount of transactions was incredibly high. We couldn't plot 5 million individual data points without crippling the browser. Instead, we aggregated the data, using SQL, to total the transactions per merchant per day. We could then use the heatmap API's weighting properties to show the volume without having to show each individual transaction as it happened. As we had a limited number of merchant locations (1100+) we knew that we would only ever be dealing with those 1100 longitude and latitude pairs.

Don’t Judge Us, Judges

The data on our web page was simply stored in a very large JavaScript array. Given more time we would have built a backend to feed the front end with the data. That would have greatly aided optimizations. But, hey this was a hackathon and time was against us.

To build the array we took our SQL and added an extra, dynamically-generated column which essentially wrote the JavaScript code of the array. Ugly? Yes! But, it allowed us the ability to tweak the SQL, date ranges, parameters and in seconds get a fully regenerated array. We were able to order the array too. This helped in optimizing the front end, as we knew that at certain dates we could stop looping over the array.

Takeaways

So what did we learn?
 
  1. It’s amazing what cool stuff you can produce in a hurry with a good team and a few shortcuts.
  2. API limits can really spoil your day.
  3. Google have some very cool mapping APIs.
  4. Providing a visual representation of large volumes of data is an incredibly powerful way to communicate the information hidden inside.
We hope you found something interesting to take away from our project.

Video Resources

The screenshots and writeup can't do justice to how cool this data is. Check out these short videos to see our project in action!