Tutorial – Creating a webchat with web.py

BitBucket Repository

I have previously written a tutorial on how to do long polling on webpy framework. Well, this is an updated version with a code repository that you can use to include an webchat subapp in your webpy applications.

Tools required

Frontend implementation – Layouts

For the main page, it would consist of one main area where the messages will be displayed. And an area at the bottom which consist of a type input, where the messages are submitted.

All the javascript coding will be included in the webchat.js, which will be described in more detail later.

/static/layout/index.html:

$def with (msgs)
<script src="static/js/jquery-1.11.1.min.js"></script><script src="static/js/webchat.js"></script>
<div id="msgs">
<h1>Messages</h1>
$for msg in msgs:
<div class="msg">$msg</div>
</div>
<div id="entry"><input id="message" type="text" placeholder="Message here ..." />
<input type="submit" value="submit" /></div>

The styling is not so important for this project, but you can view the CSS style in the code repository.

Storing the messages

We will be using sqlite to store our messages. So, open up your console, and type in sqlite3 database.db, and create the table using the following schema:

CREATE TABLE msgs(
msg_id INTEGER PRIMARY KEY,
msg_content TEXT NOT NULL
);

Simulating real-time events – Client side

To simplify our lives, we will be using jquery for the ajax calls. Everytime a call is returned, the message wrapper(#msgs) will be updated and another call is made.

Another function is needed to send the message to the server for storing into the database.

These functions will be included inside /static/js/webchat.js:

function sendMsg(){
var div = $('#message')[0];
var msg = div.value;

if(msg==''){
alert('Message cannot be blank');
}else{
div.value = '';
$.ajax({url:'/send?msg='+msg});
}
}

function longPoll(idx){
$.get(
url='/get?idx='+idx
).done(
function(data){
data = eval('('+data+')');
var msgs = data['msgs'];
if(msgs){
for(var i=0; i<msgs.length; i++){
$('#msgs')[0].innerHTML += "
<div class="msg">"+msgs[i]+"</div>
"
}
}
longPoll(data['idx']);
}
);
}

Simulating real-time events – Server side

Framework for the webpy application

This is the basic framework for webchat.py:

import web, time, json

urls = [
'/', 'Index',
'/send', 'SendMsg',
'/get', 'GetMsg'
]
render = web.template.render('layouts')
app = web.application(urls, globals())
db = web.database(dbn='sqlite', db='database.db')

class Index:
def GET(self):
return

class SendMsg:
def GET(self):
return

class GetMsg:
def GET(self):
return

if __name__ == '__main__':
app.run()

Serving the existing messages

Alter the Index class to read messages from the database and render it inside the layout:

class Index:
def GET(self):
res = db.select('msgs')
msgs = [r for r in res]

content = [m['msg_content'] for m in msgs]
return render.index(content, msgs[-1]['msg_id'])

Add new messages to the database

Insert the content using /send into the database:

class SendMsg:
def GET(self):
i = web.input()
if i.get('msg'):
db.insert('msgs', msg_content=i.get('msg'))
return
else:
return web.notfound()

Getting new messages – Long Polling

This is the important bit. The server side code for the long polling process:

class GetMsg:
def GET(self):
i = web.input().get('idx')
print i
if not i or not i.isdigit(): i = '0'

max_iter = 20; iter = 0
msgs = []
while not len(msgs) and iter'+i)
msgs = [r for r in res]
iter += 1
time.sleep(i)

if len(msgs): i=msgs[-1]['msg_id']

return json.dumps({
'msgs':[m['msg_content'] for m in msgs],
'idx':i
})

A bit of explanation here. The while loop and the time.sleep is what makes the whole idea works. The server will keep the request open until it detects a change in the database.

Running the server – Gevent WSGIServer

If you are running this as a subapp. Here is all you have to know. If you are running it as a standalone application, do read on.

Webpy’s default web server is not that great for production. Instead, we will be using the Gevent WSGIServer to run the application.

from gevent import monkey, pywsgi;
monkey.patch_all();

''' Whatever code you have '''

if __name__ == '__main__':
print 'WSGISever on 8080'
application = app.wsgifunc()
pywsgi.WSGIServer(('', 8080), application).serve_forever()

Fixing the static directory problem

The WSGIServer does not automatically serve static file like webpy. Fortunately, there is a fix for this:

urls = [
'/static/(.*)', 'Static',
'/', 'Index',
'/send', 'SendMsg',
'/get', 'GetMsg'
]

class Static:
def GET(self, file):
try:
f = open('static/'+file, 'rb')
return f.read()
except:
return web.notfound()

Going on from here

If you are lost in any of the steps above, do visit the code repository. It might be helpful to have a working sample where you can follow. Any troubles, drop me an email or leave it in the comments, I will help to be the best of my abilities.

BitBucket Repository

Explanation Bag of Words (BoW) – Natural Language Processing

Introduction

Bag of Words (BoW) is a model used in natural language processing. One aim of BoW is to categorize documents. The idea is to analyse and classify different “bags of words” (corpus). And by matching the different categories, we identify which “bag” a certain block of text (test data) comes from.

Putting into context

One excellent way to explain this is to put this model into content. One classic use of BoW is for spam filtering. Through the use of the BoW model, the system is trained to differentiate between spam and ham (actual message). To extend the metaphor, we are trying to guess which bag the document comes from, the “bag of spam” or the “bag of ham”.

Note: I will not be explaining the logic behind how the spam filter works (though I might do it in a different post). I am just giving the example so you can understand the rationale of categorizing different text.

How BoW works

Forming the vector

Take for example 2 text samples: The quick brown fox jumps over the lazy dog and Never jump over the lazy dog quickly.

The corpus(text samples) then form a dictionary:

{
    "brown": 0,
    "dog": 1,
    "fox": 2,
    "jump": 3,
    "jumps": 4,
    "lazy": 5,
    "never": 6,
    "over": 7,
    "quick": 8,
    "quickly": 9,
    "the": 10,
}

Vectors are then formed to represent the count of each word. In this case, each text sample (i.e. the sentences) will generate a 10-element vector like so:

[1,1,1,0,1,1,0,1,1,0,2]
[0,1,0,1,0,1,1,1,0,1,1]

Each element represent the number of occurrence for each word in the corpus(text sample). So, in the first sentence, there is 1 count for “brown”, 1 count for “dog”, 1 count for “fox” and so on (represented by the first array). Whereas, the vector shows that there is 0 count of “brown”, 1 count for “dog” and 0 count for “fox”, so on and so forth

Weighting the terms: tf-idf

As of most languages, some words tend to appear more often than other. Words such as “is”, “the”, “a” are very common words in the English language. If we take consider their raw frequency, we might not be able to effectively differentiate between different classes of documents.

A common fix for this is to use a statistical method known as the tf-idf to make the data more accurate, reflecting the context of the text sample better. TF-IDF, short for term frequency-inverse document frequency takes into account 2 values: term frequency(tf) and inverse document frequency(idf).

There are a few different ways to determine these values, but one common way to determine the value of the term frequency is to basically just take the raw frequency of a term divided by the maximum frequency of any term in the document, like so:

0.5 + 0.5 * freq(term in document)/max(freq(all word in document))

One common way to determine the inverse document frequency is to take the log of the inverse of the proportion of documents containing the term, like so:

log( document_count/len(documents containing term) )

And by multiplying both values, we get the magic value, term frequency-inverse document frequency (tf-idf), which reduces the value of common words that are used across different documents.

An additional step after obtaining is the tf-idf is to normalize the vector, will makes it less troublesome to apply different operators to.

Taking it further: Feature hashing / Hashing trick

The basic concept explained here may work for small sample of text where the dictionary size is rather small. But for the actual training process, there are text with tens of thousands of unique words, we would need some way to represent the document more efficiently.

By hashing the terms(i.e. the individual words), we obtain a index, which corresponds to the element in the generated vector. So instead of having to store the words in a dictionary and having a ten-thousands-elements long vector for each document, we have a N-size vector instead(N determined by whoever makes the decision).

To account for hash collisions, an additional hash function would then be implemented as a operator choosing function (returns 1 or 0). This helps ensure that when different entries collide, they will cancel each other out, giving us a expected value of 0 for each element(More on that here). And with that, the final vectors of the documents would then be used to classify different types of documents(e.g. spam VS ham).

For those who see better through code, this is how the function might look like:

function hashing_vectorizer(features: array of strings, N: integer):
    x = new Vector[N]
    for f in features:
        h = hash(f)
        ### sign operator
        if hash2(f):
            x[h%N] += 1
        else:
            x[h%N] -= 1
    return x

Conclusion

After the BoW is completed, what is obtained would be a vector for each individual document. These documents will then be passed through different machine learning algorithms to determine the features that separates the different documents.

That is where the actual “machine learning” comes in. BoW is basically just a tool to convert text documents into a vector that describes its features and content.

Note: Do drop me a note if there is any unclear portions. Trust me, I will get better at this.

Readings:

Explanation Posts – “All you need to know about” posts

Soon I will be starting a series of post which provides simple and concise explanations for the various concepts that I have found pretty hard to grasp. Many of these ideas are hard to understand, not because they are difficult, but they consist of different portions of the theories that are often explained separately, making it hard to understand the whole concept in one sitting.

In these explanation posts, I will be explaining some of the many different concepts that I have learnt and understood. I will be explaining the concepts as simply as possible. Hopefully, having a structured approach to these concepts would help form a more concrete understanding.

Tutorial – Using sessions in web.py

Intro

Sessions in web.py are like server-side cookies. Cookies are objects used to store simple information for either identification purposes or to keep track of user’s preference. (If you already knew that, give yourself a cookie -> yep, that brown backed confection). Sessions are used to identify different users on the website, they make use of client-side cookies and IP address for identification, among other things.

When you try out other tutorials out there, you might find yourself stuck at some parts, getting random errors that makes little sense to you. I have been through that process and I am here to share how it is actually done.

For the tutorial, I will be making a simple counter app that makes use of sessions to track the number of times a page is visited. There is an explained version for those that are not too used to webpy framework and could use some help with the code.

app.py (short and sweet version):

import web

urls = [
    '/', 'Index',
    'k', 'Kill'
]
app = web.application(urls, globals())

if not web.config.get('session'):
    session = web.session.Session(app, 
        web.session.DiskStore('./sessions'),
        initializer={'count':0}
    )
    web.config.session = session
else:
    session = web.config.session

class Index:
    def GET(self):
        session['count'] += 1
        return session['count']

class Kill:
    def GET(self):
        return web.seeother('/')

if __name__ == '__main__':
    app.run()

app.py (explained):

Readings

Web.py Application – User Authentication system

EDIT: Visit the tutorial here

Github Repository

After working on so many web designs, I have decided to hone my server side development skills for a change.

One of the most common web application system is an user authentication system. Many applications, regardless of its purpose, requires a user authentication system.

Using webpy, I developed a simple user authentication system, together with a few error handling pages and functions, which hopefully, can be altered to fit my future projects.

This project, though simple, took me more time than I had planned. I wasted quite some time looking into how webpy sessions work. There are a few troublesome portions with using sessions with webpy(tutorial here). But still I managed to get everything working fine, even completing a simple design for the interface.

Bitbucket Repository

Design Mockup – InsuranceSuite from 99designs

This design that I have chosen to implement is a content-based design. This differs significantly from other designs I have implemented, which focused more on graphics and aesthetics.

For the structure, I used a new css3 method for the layout. Flexbox layout greatly simplifies the structure of the layout design, reducing dependencies on pure css organization for the content layout. This could be very helpful, especially in creating mobile web applications.

That said, this flexbox structure certainly helps in content based designs, but for more graphical based design, it wouldn’t really do much difference. Therefore, future usage would have to depend on types of designs that I choose to implement.

The content of the mockup is quite scattered. I tried to implement as many design elements into the mockup as possible. Hence, I had to resort to using hidden links to activate different views (sidebar links toggles the form overlay), mashing up everything to give a complete feel of the design.

Original design:
Insurance Suite by JonSerenity