Developing a data product

I wanted to develop a product end to end. One product idea was to develop a service that would identify web technologies used by popular sites. The “end to end” development of such a product would require crawling URLs, parsing of raw website data, data processing, server-side web development, UI work and serving the final product online. All aspects I wanted to tinker with. The final product would allow a user to input a JavaScript library name and see a sample of URLs purportedly using the library.

The top 1 million Alexa sites seemed like a good place to find a list of URLs to crawl.

Most of the development was done using AWS services. The final web application though is served using Google Cloud’s Cloudrun service.

There are a few high-level moving parts that need to be highlighted;

1) Crawler
2) Data processing
3) Web service/application

CRAWLER
Simply put, crawlers are programs that extract information from URLs. URLs found when parsing the information extracted from previous URLs are also crawled. This process is described popularly as “spidering”. Though simple in notion, implementing a high-performance crawler is challenging. For my purposes, I created a simple crawler as shown in the diagram below;

1) A Simple Queue Service(SQS) queue is seeded with the top 1M Alexa URLs.
2) Each EC2 instance runs the crawler process. The crawler process 1) requests a URL from SQS to crawl 2) requests the URL and 3) submits the received response into a Kinesis Firehose stream.
3) The Kinesis Firehose stream transports data to S3.

The crawler uses Splash, a headless browser service that requests the URL. A powerful feature of Splash is the ability to write Lua scripts to perform customizations.

It is important to rate-limit your crawling and not overload websites with your requests. For this project, I only visited a URL once and did not crawl any out links. Additionally, always respect the robots.txt directives when you crawl.

DATA PROCESSING
Raw data collected from 1M Alexa URLs are stored in S3 and is processed using Spark running on an Elastic Map Reduce(EMR) cluster. The result of data processing is a cleaned-up dataset that maps a URL to a JavaScript variable, and the JavaScript variable to its parent JavaScript library, as shown in the table below. The final application allows the user to type in the JavaScript library name, such as “New Relic” and see a sample of websites that are using “New Relic”.

URL JavaScript Variable Name Library Name
http://apmterminals.com GoogleAnalyticsObject Google Analytics
http://viavisolutions.com NREUM New Relic

WEB APPLICATION
The final results are stored in a SQLite database and served using a Python Flask web application which you can play within the iframe below.

Tattleweb