Presenting your skills with real examples

For those graduating with a degree in Analytics and Data Science, you have many resources to work with when it comes to presenting your skills to hiring managers.
Some examples below.

Search for filetype:pdf and case studies in your field. Identify the ones that stand out. Now go over to Google Dataset Search and find a dataset that seems relevant to your field. Break out your spreadsheet skills, SQL, R or Python chops. Extract a few key insights and useful visualizations. Put together a 1 to 2 page document in the style of a case study. A great way to discover what others have done with open datasets is to head over to Kaggle and check out the various kernels that folks have created exploring a dataset. See this as an example.

Tableau, PowerBI, Google Data Studio all allow users to post public dashboards. Find a dataset and develop dashboards and post it publicly. Take care to polish these dashboards and apply the same rigour you would if this were a course submission. These public dashboards you develop are a great way to showcase your skills to hiring managers.

Shiny contests are a great way to showcase your skills and work on an interesting project. Even if you do not wish to participate in a contest, developing a Shiny app and hosting it online should give you something substantial to point to in conversations with hiring mangers.

Your SQL skills can be put to use by analyzing public datasets. Identify a dataset of interest and analyze it with SQL. Not only will you learn a popular cloud service but will also get access to real world data. Colab is a great way to access BigQuery and develop your analysis.

If you’d like to showcase your ML chops, winning Kaggle competitions is a great way to get started. You can also develop an end to end data product. Use Dash to develop a web application with a simple front-end for users to interact with your model.

Looking for your first job can be made easier by showcasing your skills with some examples from the list above. This may not guarantee anything but does definitely help hiring managers learn more about your skills and capabilities. It also shows your resolve in improving your chances of success by doing more than is usually expected. Best of luck!

Thoughts on the use of AI models distributed online

AI model marketplaces come in various shapes and sizes. Outside of the conventional marketplace approach, large Internet companies have made open-source contributions that include not just libraries but also models. BERT is an example of a model distributed online by Google that has gained popularity in recent times amongst industry practitioners and researchers alike.

The distribution pre-trained models in NLP by huggingface, efforts like the model zoo and the Open Neural Network Exchange make pre-trained models available and easily accessible to a wide audience. huggingface in particular has developed libraries that have gained significant traction in the NLP community for their ease of use and the wide variety of models they have made available, often accessible with just a few lines of code.

An example of a traditional marketplace for AI models is the AWS Model Marketplace. An extensive survey by M. Xiu, Z. M. J. Jiang and B. Adams, “An Exploratory Study on Machine-Learning Model Stores,” in IEEE Software explores features of model marketplaces while comparing them with popular app stores. The comparison is informative and identifies how model stores are organized along;

  • Product Information
  • Technical Documentation
  • Delivery
  • Business
  • Product Submission & Store Review
  • Legal Information

The table below is from this survey paper, accessible on arxiv.

With AI marketplaces proliferating and pre-trained models being widely adopted, it has become imperative for practitioners to explore any underlying biases that models might be prone to. Many consumers of models may overlook model antecedents much the same way many consumers don’t rigorously investigate apps that they install on their smartphones. While this may sound like an unforgiving indictment of AI practitioners, it isn’t unfathomable to imagine that the overlooking of model bias does exist and many models are often used in a plug and play manner without much investigation of their origins.

A recent pre-print, Fairness in Deep Learning:A Computational Perspective describes many aspects of fairness in AI models and develops a framework with which to measure and mitigate bias. Some examples of model bias are represented in the table below referenced from the pre-print.

A comprehensive collection of resources related to fairness in AI can be found here. It is a matter of time before marketplaces adopt fairness testing infrastructure as a gate for models to pass before they can be released or distributed.

Another very important aspect that practitioners need to care about as they consume models from marketplaces is model interpretability. Model interpretability, especially in the context of deep neural networks is a critical piece in the deployment and usage of AI models. As models assume more complexity in structure and proliferate our day to day lives, knowing what levers affect specific individual predictions can go a long way in interpreting model decisions. Model false positives may sound clinical and dry but their manifestation in the real world can upend the lives of individuals who are at the receiving end of AI errors. Such situations may demand looking under the hood of black-box models to discern their inner mechanisms.

The diagram below is from a recent pre-print, Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI, and describes the complexity involved with the explainability of AI models.

The use of AI models, wether consumed from a marketplace or developed in-house, requires practitioners to be considerate of a large number of aspects. The figure below from the same paper referenced above puts this in perspective under the rubric of Responsible AI.

The use of AI models that are distributed online pose many challenges but also offer a large number of advantages. Models distributed by large organizations are often difficult to develop in-house without expending a significant number of resources, and is often financially prohibitive. The use of AI is only increasing and practitioners are responsible for developing models that have passed through critical checks and balances before they are allowed to operate in production. While it may have become increasingly easy to train complex models given the sophistication of libraries and tools, there is a lot of resources available to ease the adoption of Responsible AI practices. While it was data cleaning and feature engineering that consumed a large portion of the model lifecycle until many years ago, it is increasingly becoming obvious that the bulk of the model lifecycle will involve the necessary tasks that need to be performed under the umbrella of responsible and sustainable AI.

Crowdsourcing IP reputation data from online forums

In this post, I discuss the topic of crowdsourcing IP reputation data from online forums. The post is inspired by a paper I read recently, Gharibshah J., Papalexakis E.E., Faloutsos M. (2018) RIPEx: Extracting Malicious IP Addresses from Security Forums Using Cross-Forum Learning. In: Phung D., Tseng V., Webb G., Ho B., Ganji M., Rashidi L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science, vol 10939. Springer, Cham. Read it on arxiv.

I describe some concerns I see with crowdsourcing of security-related data while also discussing potential solutions.

In fighting fraud and malicious behavior, organizations are better suited to using internal data sources and institutional knowledge to quantify risk. Trust and Safety groups in many organizations that deal with user-generated content, process payments or serve visitors online often have access to data science models that derive risk scores from a variety of internal data sources. These risk scores are then utilized to apply friction such as throttling services/APIs, rate-limiting of user behavior and in egregious cases, banning users from the platform or from accessing their service.

The importance of internal data sources to assess risk is obvious. One compelling argument in favor of incorporating external data sources is the deterrents against bad actors who may have hitherto not visited your service but may in the future. Additionally, increasing your inventory of known high-risk entities is a reasonable endeavor if not a necessary one. For organizations that wish to harden their services and systems from the outset (new product or service release), a blocklist is essential and can often be derived from freely available online data.

Popular examples of crowdsourced blocklists exist to block ads, throwaway email services and hosting providers that can be leveraged by organizations to improve their defenses. These lists often exist in easily consumable text formats, with accompanying code for easy integration and should perhaps be considered a first step in consolidating and utilizing security information available online. Many of the concerns that may plague sourcing information from online forums are mitigated when consuming from the kind of lists mentioned here. Some of these sources are public repositories on Github which reveal the level of activity, tenure, and maturity of the repository. This is invaluable when sourcing online data to incorporate into internal decision-making processes especially considering that real users can be impacted. Denying services to real users due to false positives not only affects businesses adversely with real revenue loss, but denies users access to essential online services that are critical aspects of professional workflows and personal conveniences.

Web crawling, a necessity to crowdsource data from online forums isn’t a trivial undertaking. With commercial scraping services available, some of the challenges associated with web scraping can be mitigated somewhat. The tradeoff here is to find (and then crawl) enough number of forums to generate reliable and a large variety of data, while also managing the cost associated with maintaining and updating crawlers.

Parsing crawled data is another challenge that requires careful quality control such as IP address verification, updating crawl frequency of target sites, dealing with the staleness of forum data, adversarial attacks wherein forum data can be poisoned to spoil the integrity of anti-fraud models or blocklists. These challenges are compounded when dealing with a large number of forums that vary in the structure and web markup semantics that can complicate and overwhelm the development of forum specific parsers.

Many of these challenges are prohibitive to solve but not entirely insurmountable. IP addresses are well understood and most programming languages have mature libraries to verify IP address strings as per standards. Forum integrity is largely based on the popularity of forums in the Security community, the volume and recency of activity and the quality of discussions. For security professionals, it may be relatively easy to identify which forums have a high signal to noise ratio and which forums to ignore. This can reduce parse complexity by reducing the number of forums to crawl and parse.

Other concerns related to fake data or poisoned data can be verified by cross-referencing across other reliable security forums and against the well-formed lists available on Github. The absence of any overlap may not be a sure giveaway of bad data but can be used as a signal to downsample those entities or discard altogether. Additionally, usernames associated with forum posts that are being crawled can be utilized to develop a user reputation over time which can help filter out forum data from users who post infrequently, share information without any overlaps across other reliable forums or post messy data.

While the cost of crowdsourcing data from online security forums may seem intractable, there are some clear advantages over IP reputation data available from blocklists or IP databases such as Maxmind. Some advantages are;

  • Availability of information in the early phases of the attack life-cycle. A user or group of users may have posted details that may otherwise take much longer to percolate into more conventional data repositories.
  • Forums may contain rich details posted by the user providing additional context around the IP addresses shared. E.g., knowing if the IP belongs to a botnet, hosting provider or anonymous proxy can help augment downstream systems with additional features to aid in decision making. Additional details such as specific botnet name can provide useful tags for downstream analytics.
  • Access to highly reliable and multiple oracles. Security researchers and white-hat hackers are always on the lookout for threats and threat intelligence. A large number of such individuals share their insights freely and openly online. The information shared is very valuable not to mention multi-faceted and can provide additional levers with which to improve defenses.

Crowdsourcing data from online security forums is recommended for teams who have access to cross-functional professionals given the variety of technologies that need to be stitched together to create and maintain such a program. Alternatively, threat intelligence vendors exist who consolidate data from a variety of data sources and make consumable feeds available as commercial offerings. Organizations are also known to come together to share knowledge and threat intelligence. While the idea of crowdsourcing data is appealing, its effectiveness should be considered compared to sourcing threat intelligence data from more conventional sources. Nevertheless, I believe it is a worthy endeavor to identify reliable online data sources that can be assimilated into the overall threat intelligence strategy.

Developing a data product

I wanted to develop a product end to end. One product idea was to develop a service that would identify web technologies used by popular sites. The “end to end” development of such a product would require crawling URLs, parsing of raw website data, data processing, server-side web development, UI work and serving the final product online. All aspects I wanted to tinker with. The final product would allow a user to input a JavaScript library name and see a sample of URLs purportedly using the library.

The top 1 million Alexa sites seemed like a good place to find a list of URLs to crawl.

Most of the development was done using AWS services. The final web application though is served using Google Cloud’s Cloudrun service.

There are a few high-level moving parts that need to be highlighted;

1) Crawler
2) Data processing
3) Web service/application

CRAWLER
Simply put, crawlers are programs that extract information from URLs. URLs found when parsing the information extracted from previous URLs are also crawled. This process is described popularly as “spidering”. Though simple in notion, implementing a high-performance crawler is challenging. For my purposes, I created a simple crawler as shown in the diagram below;

1) A Simple Queue Service(SQS) queue is seeded with the top 1M Alexa URLs.
2) Each EC2 instance runs the crawler process. The crawler process 1) requests a URL from SQS to crawl 2) requests the URL and 3) submits the received response into a Kinesis Firehose stream.
3) The Kinesis Firehose stream transports data to S3.

The crawler uses Splash, a headless browser service that requests the URL. A powerful feature of Splash is the ability to write Lua scripts to perform customizations.

It is important to rate-limit your crawling and not overload websites with your requests. For this project, I only visited a URL once and did not crawl any out links. Additionally, always respect the robots.txt directives when you crawl.

DATA PROCESSING
Raw data collected from 1M Alexa URLs are stored in S3 and is processed using Spark running on an Elastic Map Reduce(EMR) cluster. The result of data processing is a cleaned-up dataset that maps a URL to a JavaScript variable, and the JavaScript variable to its parent JavaScript library, as shown in the table below. The final application allows the user to type in the JavaScript library name, such as “New Relic” and see a sample of websites that are using “New Relic”.

URL JavaScript Variable Name Library Name
http://apmterminals.com GoogleAnalyticsObject Google Analytics
http://viavisolutions.com NREUM New Relic

WEB APPLICATION
The final results are stored in a SQLite database and served using a Python Flask web application which you can play within the iframe below.

Tattleweb

Label images with a deep neural network

It isn’t often that you need to label lots of images. But when the need arises you can use many of the pre-trained deep neural network models available for image labeling tasks. One of the most popular ones is ResNet50 and keras provides a convenient of using it.

SETUP

I’m a big fan of docker and will use gw000/keras docker image that comes installed with the necessary libraries needed to use ResNet50 for image labeling. A few additional tweaks are still needed and I’ll build a custom docker image on top of gw000/keras.

First, create a directory in your workspace and cd into it.

1
2
mkdir resnet50labeling
cd resnet50labeling

Create a Dockerfile.

1
touch Dockerfile

Paste the following into your Dockerfile.

1
2
3
4
5
6
7
8
9
10
FROM gw000/keras:2.1.4-py2-tf-cpu

# install dependencies from debian packages
RUN apt-get update -qq \
&& apt-get install --no-install-recommends -y \
python-matplotlib \
python-pillow \
wget

WORKDIR /

Build the image.

1
docker build -t resnet50labeling:v1 .

Create container from the image and shell into it.

1
docker run -it resnet50labeling:v1 /bin/bash

LABELING IMAGES

Once inside the container download an image to label. The command below downloads an image of a golden retriever.

1
wget https://i.ytimg.com/vi/SfLV8hD7zX4/maxresdefault.jpg

Next, start a Python session inside the container and get ready to label the image downloaded above. The Python snippet below will be used to label the downloaded image. While executing the snippet below, a large file download will take place. This file contains the pre-trained weights of the ResNet50 deep neural network that is necessary to label the image.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import keras
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np

model = ResNet50()

input_img_path = 'maxresdefault.jpg'
img = image.load_img(input_img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

predictions = model.predict(x)
print('Predicted:', decode_predictions(predictions, top=3)[0])

If all goes well you should see the following;

REFERENCES
ResNet50 is a large deep neural network architecture trained on more than 1 million images on ImageNet. This paper and this has more details on the original ResNet models.

ResNet50 in Keras defines the deep neural network architecture and provides a convenient way to use it as I’ve shown in this post.

Deploy a scalable image labeling service

If you’ve developed a deep neural network model that takes an image and outputs a set of labels, bounding boxes or any piece of information, then you may have also wanted to make your model available as a service. If this question has crossed your mind, then the post below may provide some answers.

Building blocks discussed in the post are;

1) Docker to containerize the web application.
2) Resnet50 provides the pre-trained deep neural network that labels an image.
3) Python CherryPy is the web framework used to develop the web application.
4) Google Cloud Platform’s Cloudrun is the service that is used to deploy the containerized web application to the cloud.

SETUP

Create a directory in your workspace and cd into it.

1
2
mkdir resnet50service
cd resnet50service

Create a Dockerfile.

1
touch Dockerfile

Copy the following into the Dockerfile.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FROM gw000/keras:2.1.4-py2-tf-cpu

# install dependencies from debian packages
RUN apt-get update -qq \
&& apt-get install --no-install-recommends -y \
python-matplotlib \
python-pillow \
wget

# install dependencies from python packages
RUN pip --no-cache-dir install \
simplejson \
cherrypy

WORKDIR /
COPY resnet50_service.py /
COPY resnet50_weights_tf_dim_ordering_tf_kernels.h5 /
ENTRYPOINT ["python", "resnet50_service.py"]

The Dockerfile describes the “recipe” to create a suitable environment for your application. The base image already comes with the keras library. A few additional libraries are installed. cherrypy is used to develop the web application.

resnet50_service.py is a Python program that will create the image labeling service and resnet50_weights_tf_dim_ordering_tf_kernels.h5 are the pre-trained model weights that are used to predict labels from an image.

CREATE IMAGE LABELING SERVICE

Copy the Python code below into a file named resnet50_service.py and save it in the resnet50service directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np
import os
import cherrypy
import tensorflow as tf
import uuid
import simplejson


print("ResNet50 service starting..")

# Initialize model
model = ResNet50(weights= 'resnet50_weights_tf_dim_ordering_tf_kernels.h5')
graph = tf.get_default_graph() # See https://github.com/keras-team/keras/issues/2397#issuecomment-254919212 explaining need to save the tf graph


def classify_with_resnet50(img_file):
label_results = []
img = image.load_img(img_file, target_size=(224, 224))
os.remove(img_file)
# Convert to array
img_arr = image.img_to_array(img)
img_arr = np.expand_dims(img_arr, axis=0)
img_arr = preprocess_input(img_arr)
# Make prediction and extract top 3 predicted labels
# see https://github.com/keras-team/keras/issues/2397#issuecomment-254919212 for additional details on using global graph
global graph
with graph.as_default():
predictions = model.predict(img_arr)
predictions = decode_predictions(predictions, top=3)[0]
for each_pred in predictions:
label_results.append({'label': each_pred[1], 'prob': str(each_pred[2])})
return simplejson.dumps(label_results)


class ResNet50Service(object):
@cherrypy.expose
def index(self):
return """
<html>
<head>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.4.0/jquery.js"></script>
</head>
<body>
<script>
// ref https://codepen.io/mobifreaks/pen/LIbca
function readURL(input) {
if (input.files && input.files[0]) {
var reader = new FileReader();
reader.onload = function (e) {
$('#img_upload')
.attr('src', e.target.result);
};
reader.readAsDataURL(input.files[0]);
}
}
</script>
<form method="post" action="/classify" enctype="multipart/form-data">
<input type="file" name="img_file" onchange="readURL(this);"/>
<input type="submit" />
</form>
<img id="img_upload" src=""/>
</body>
</html>

"""

@cherrypy.expose
def classify(self, img_file):
upload_path = os.path.dirname(__file__)
upload_filename = str(uuid.uuid4())
upload_file = os.path.normpath(os.path.join(upload_path, upload_filename))
with open(upload_file, 'wb') as out:
while True:
data = img_file.file.read(8192)
if not data:
break
out.write(data)
return classify_with_resnet50(upload_file)


if __name__ == '__main__':
cherrypy.server.socket_host = '0.0.0.0'
cherrypy.server.socket_port = 8080
cherrypy.quickstart(ResNet50Service())

There are two endpoints defined and both respond to an image by returning the top 3 predicted labels with their probabilities. One endpoint, provides a simple UI to upload an image to the service and is accessible via a web browser and the other endpoint, /classify accepts image files programmatically.

DOWNLOAD PRE-TRAINED MODEL WEIGHTS

Download the ResNet50 pre-trained weights file into the resnet50service directory.

1
wget https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5

You should now have three files in your resnet50service directory;

1) Dockerfile

2) resnet50_service.py

3) resnet50_weights_tf_dim_ordering_tf_kernels.h5

You are now ready to create the container that runs image the labeling service.

CREATE AND RUN CONTAINER IMAGE

docker build -t resnet50labelservice:v1 . will build the image.

docker run -it -p 8080:8080 resnet50labelservice:v1 will run the image labeling service inside the container.

LABEL AN IMAGE FILE

Open you web browser and type in localhost:8080 in the address bar. You will see a simple form to which you can upload an image. Choose an image file you would like to label and submit to the form.

The following Python snippet demonstrates how you can use submit images to the /classify endpoint in a programmatic way.

1
2
3
4
5
6
7
8
9
import requests

labeling_service_url = 'http://localhost:8080/classify'
# Replace below with image file of your choice
img_file = {'img_file': open('maxresdefault.jpg', 'rb')}
resp = requests.post(labeling_service_url, files=img_file)

print(resp)
print(resp.text)

You have successfully created a containerized image labeling service that uses a deep neural network to predict image labels. At this point, you can deploy this container on an internal server or to a cloud service of your choice. The rest of this post describes how you can deploy this containerized application to Google Cloud Platform’s Cloudrun service. Steps described below will work for any containerized application.

DEPLOY TO CLOUDRUN

Cloud Run is a managed compute platform that automatically scales your stateless containers.

Do the following before you can start the process of building and deploying your container.

  • Install Google Cloud SDK.
  • Create a project in Google Cloud console.
  • Enable Cloud Run API and Cloud Build API for the newly created project.

You are now ready to follow these instructions to build and deploy your container to Google Cloud Run.
The instructions first help you build your container and submit to Google Clouds’ container registry after which you run the container to create a service.

Run the following commands while you are in the resnet50service directory.

The gcloud builds submit command below will build your container image and submit to Google Clouds’ container registry. Replace mltest-202903 in the commands below with your own projects’ name.

1
gcloud builds submit --tag gcr.io/mltest-202903/resnet50classify

Now that you have built and submitted your container image, you are ready to deploy the container as a service.

The gcloud beta run deploy command below will create a revision of your service resnet50classify and deploy it. The --memory 1Gi parameter is necessary without which the deployment fails (due to the container requiring more than the 250m default memory).

Once you invoke the command below, you will be prompted to choose a region (select us-central1) and service name (leave default). For testing purposes you can choose to allow unauthenticated invocations but remember to delete this service after you are done testing.

After the command succeeds you will be given a url which you can paste your in your browser to load the image labeling submit page. Give it 3 to 4 seconds to load.

1
gcloud beta run deploy --image gcr.io/mltest-202903/resnet50classify  --memory 1Gi --platform managed

After successfully deploying I received https://resnet50classify-ntnpecitvq-uc.a.run.app url.

REFERENCES

Very helpful tutorials here and here on file upload in cherrypy.

Using a tensorflow model in a web framework can cause inference to happen in a different thread than where the model is loaded, leading to ValueError: Tensor .... is not an element of this graph. I faced the same issue and used the solution provided here.

See here to further customize the gcloud beta run deploy command.

Learn SQL in a browser with PostgreSQL and pgweb

PostgreSQL is a very versatile database. If you want to learn SQL, then a quick way to start is to 1) grab some data you want to analyze 2) insert into a PostgreSQL table and 3) use a SQL client such as pgweb and get started analyzing data. You can use any SQL client of your choice but pgweb is easy to use and is browser based which makes it a very convenient choice.

As with many of my posts, I’ll use Docker to run PostgreSQL and pgweb.

START A POSTGRESQL CONTAINER

1
docker run -d -p 5432:5432 --name postgres_db -e POSTGRES_PASSWORD=postgres postgres

START A PGWEB CONTAINER

1
docker run -d -p 8081:8081 --link postgres_db:postgres_db -e DATABASE_URL=postgres://postgres:postgres@postgres_db:5432/postgres?sslmode=disable sosedoff/pgweb --readonly

The --link flag here is being used to allow the pgweb container to access the PostgreSQL database running inside another container. Recall the --name flag to set the name of the PostgreSQL container to postgres_db. The name is now being used to point the pgweb container to where PostgreSQL is running. This is evident in the environment variable DATABASE_URL that provides the connection details needed by pgweb to connect to PostgreSQL.

Go over to your browser and type localhost:8081 to access pgweb. There isn’t any data available yet which will be fixed below.

INSERT DATA INTO POSTGRESQL

In the snippet below, I use pandas to grab the famous iris dataset. I then define a connection using sqlalchemy to the PostgreSQL container. Since I’m doing this on the host system, I can use localhost to connect to the container. The containers by default are available on localhost on the host system.

Finally I use to_sql function to write the dataset to the database.

1
2
3
4
5
6
7
import pandas as pd
from sqlalchemy import create_engine
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')

iris.to_sql("iris", engine, if_exists = 'replace', index=False, chunksize=1000)

Now when I go over to localhost:8081 in my browser I see iris in the list of tables.

NOTE

PostgreSQL Persisting your PostgreSQL data inside your container requires you to set the appropriate -v flags. Besides data persistence there are other factors you may have to deal with when running PostgreSQL inside a container. Please find more details here for some great suggestions on this topic.

pgweb You may want to limit access to the SQL client, run pgweb on a different port, allow read-only access to the database and so on. A large number of options can be found here. To enable these options you may need to create a custom Dockerfile for pgweb and enable the options of your choice. You can reuse the pgweb Dockerfile found here.

Learning Docker by building an R environment

Docker is a technology that makes it very easy to try a piece of software or technology without running into installation problems that you would otherwise run into if you were to install software directly or natively on your system. Docker gives you an environment that you can keep separate from the rest of your computer and use this environment as a playground to try different technologies.

In the post below I describe creating an R environment using RStudio Server with Docker.

If you don’t have Docker installed, then get Docker Desktop here. Once installed the instructions below describe setting up RStudio Server.

GET RSTUDIO DOCKER IMAGE
In a terminal invoke the following commands;

1
docker pull rocker/rstudio

docker images will list the images you have locally. With the rstudio image downloaded, you are ready to start RStudio Server.

1
docker run -d -p 8787:8787 -v $(pwd):/home/rstudio -e PASSWORD=5tr0nG&_passW0rD rocker/rstudio

Flags passed to the docker run command

docker run command creates a container from an image. The flags you pass to the command are described below;

-d is to start a container in detached mode. Setting this will return the terminal prompt back to you after invoking command.

-p is to map ports between the container and the host machine.
RStudio Server is a web application and makes RStudio available as a web service. A web service is run on a specific port. When running a web service inside a Docker container, it is necessary to map the port from inside the container to a port on the host system. This is done to make the web service that is running inside the container available on the host system. As a result of mapping ports, you will be able to access RStudio via a web browser on your system.

-v is used to mount a folder on your host to a folder inside the container. With -v $(pwd):/home/rstudio you are mapping the current folder $(pwd) to the /home/rstudio folder inside the container.

Mounting folders allows you to store critical files, code, data on your host system while using a container only for processing.

A word of caution here. You can easily delete files inside the host systems’ folder as a result of your actions inside the container. Please exercise caution when deleting files inside the container.

-e allows you to set environment variables in the container. With -e PASSWORD=5tr0nG&_passW0rD# we are creating a global variable PASSWORD inside the container and giving it 5tr0nG&_passW0rD# as the value. The PASSWORD environment variable is necessary because RStudio Server expects it and will error out if it is not set.

Finally, in the docker run command, after the flags have been set ypu specify the image name in this case rocker/rstudio.

You now have RStudio Server running inside a container and available to you on the host. Open up a browser and type localhost:8787 in the address bar. Login with rstudio as the username and the password you set in the environment variable.

CUSTOME DOCKER IMAGE

What we did earlier was to download a pre-built image. These images are contributed by people who have written a Dockerfile (recipe for building up the image), developed the actual image from the Dockerfile and then submitted it to Docker Hub for distribution.

As an example, consier the tidyverse Docker image. The Dockerfile can be found here and is also shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
FROM rocker/rstudio:3.6.2

RUN apt-get update -qq && apt-get -y --no-install-recommends install \
libxml2-dev \
libcairo2-dev \
libsqlite-dev \
libmariadbd-dev \
libmariadbclient-dev \
libpq-dev \
libssh2-1-dev \
unixodbc-dev \
libsasl2-dev \
&& install2.r --error \
--deps TRUE \
tidyverse \
dplyr \
devtools \
formatR \
remotes \
selectr \
caTools \
BiocManager

Copy the contents of the tidyverse Dockerfile into a file locally and name it Dockerfile. Don’t add a .txt or other extensions to the file. Edit your local Dockerfile and add any additional R libraries you want installed in the image. You can append R library names to the existing list of R libraries already being installed in the Dockerfile.

docker build -t tidyverse-custom:first . will build your custom image from the Dockerfile you just created.

In the docker build command -t sets the name of the image in name:tag format. For e.g., tidyverse-custom is the name and first is the tag. The docker build command will now build the image. This can take a while so be prepared to wait.

Once your image is ready, you can invoke docker run on your image with the same flags described above to stand up a container running RStudio Server with R libraries of your choice.

1
docker run -d -p 8799:8787 -v $(pwd):/home/rstudio -e PASSWORD=5tr0nG&_passW0rD tidyverse-custom:first

Remember to choose another host port if 8787 on your host is still in use. I’ve used 8799 in the command above.

docker ps lists your running containers. Choose the container id you wish to stop and provide it to docker stop to shut down a running container.

Hello!

Thank you for stopping by. My name is Harsh Singhal and I like to write on cybersecurity, data science, machine learning, and analytics. A summary of my professional experience can be found on LinkedIn.

Please read the blog disclaimer here.