Pipe: Infix syntax for Python | {Dev Tricks}
april 2011 by rygwdn
Pipe is a Python module enabling infix syntax in Python.
For those asking “Why ?” let’s take an example :
Compare the readability of the classical prefix syntax :
sum(select(where(take_while(fib(), lambda x: x < 1000000) lambda x: x % 2), lambda x: x * x))
And the infix syntax :
fib() | take_while(lambda x: x < 1000000) \
| where(lambda x: x % 2) \
| select(lambda x: x * x) \
| sum()
Isn’t the infix syntax more readable ?
The base class of Pipe is kept simple (7 lines of python) and is usable as a decorator permitting you to create new ‘pipeable’ functions easily. The module provides ~30 prepared pipes functions like ‘where’, ‘group_by’, ‘sort’, ‘take_while’ … A pipeable function takes an iterable (tuple, list, generator) and yields to be itself an iterator, so pipeable function can be piped together.
Let me introduce the basic usage of the Pipe module, then I’ll write some bits on how to build new ones :
To start, get it from PyPI http://pypi.python.org/pypi/pipe/1.3 and install it, open a REPL, import pipe, and play :
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pipe import *
>>> [1, 2, 3, 4, 5] | add
15
>>> [5, 4, 3, 2, 1] | sort
[1, 2, 3, 4, 5]
Until here it’s easy, to know more about available pipes, just read the help(pipe) in the REPL, all are explained with an example as a doctest
Now as we know that pipeable functions use iterables, we can try to pipe together two or more pipeables :
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | concat
'1, 3, 5'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | concat
'3, 5'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | select(lambda x: x * x) | concat
'9, 25'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | select(lambda x: x * x) | add
34
Now, a bit about lazyness, as Pipe use iterables, the evaluation of a whole Pipe is lazy, so we can play with infinite generators like this one :
>>> def fib():
... x = 1
... yield 1
... y = 1
... yield 1
... while True:
... x = x + y
... yield x
... y = x + y
... yield y
Now we can do every kind of stuff into the fibonacci sequence, like solving the 2nd problem of http://projecteuler.net in a readable one liner :
Find the sum of all the even-valued terms in Fibonacci which do not exceed four million.
>>> euler2 = fib() | where(lambda x: x % 2 == 0) | take_while(lambda x: x < 4000000) | add
>>> assert euler2 == 4613732
Isn it pretty ?
Let now see how to create new pipeable functions using the @pipe decorator :
You want to create a function that yields the first x elements from its input
You want its usage to be (1, 2, 3, 4, 5) | take(2) to take the fist 2 elements.
I know that you are thinking about a basic implementation like :
def take(iterable, qte):
for item in iterable:
if qte > 0:
qte -= 1
yield item
else:
return
Right ? You take an iterable, a qantity, and while the quantity is not reached, you just have to yield ?
OK, just add @pipe to you take function and it’s pipeable :-)
code
library
programming
python
For those asking “Why ?” let’s take an example :
Compare the readability of the classical prefix syntax :
sum(select(where(take_while(fib(), lambda x: x < 1000000) lambda x: x % 2), lambda x: x * x))
And the infix syntax :
fib() | take_while(lambda x: x < 1000000) \
| where(lambda x: x % 2) \
| select(lambda x: x * x) \
| sum()
Isn’t the infix syntax more readable ?
The base class of Pipe is kept simple (7 lines of python) and is usable as a decorator permitting you to create new ‘pipeable’ functions easily. The module provides ~30 prepared pipes functions like ‘where’, ‘group_by’, ‘sort’, ‘take_while’ … A pipeable function takes an iterable (tuple, list, generator) and yields to be itself an iterator, so pipeable function can be piped together.
Let me introduce the basic usage of the Pipe module, then I’ll write some bits on how to build new ones :
To start, get it from PyPI http://pypi.python.org/pypi/pipe/1.3 and install it, open a REPL, import pipe, and play :
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pipe import *
>>> [1, 2, 3, 4, 5] | add
15
>>> [5, 4, 3, 2, 1] | sort
[1, 2, 3, 4, 5]
Until here it’s easy, to know more about available pipes, just read the help(pipe) in the REPL, all are explained with an example as a doctest
Now as we know that pipeable functions use iterables, we can try to pipe together two or more pipeables :
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | concat
'1, 3, 5'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | concat
'3, 5'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | select(lambda x: x * x) | concat
'9, 25'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | select(lambda x: x * x) | add
34
Now, a bit about lazyness, as Pipe use iterables, the evaluation of a whole Pipe is lazy, so we can play with infinite generators like this one :
>>> def fib():
... x = 1
... yield 1
... y = 1
... yield 1
... while True:
... x = x + y
... yield x
... y = x + y
... yield y
Now we can do every kind of stuff into the fibonacci sequence, like solving the 2nd problem of http://projecteuler.net in a readable one liner :
Find the sum of all the even-valued terms in Fibonacci which do not exceed four million.
>>> euler2 = fib() | where(lambda x: x % 2 == 0) | take_while(lambda x: x < 4000000) | add
>>> assert euler2 == 4613732
Isn it pretty ?
Let now see how to create new pipeable functions using the @pipe decorator :
You want to create a function that yields the first x elements from its input
You want its usage to be (1, 2, 3, 4, 5) | take(2) to take the fist 2 elements.
I know that you are thinking about a basic implementation like :
def take(iterable, qte):
for item in iterable:
if qte > 0:
qte -= 1
yield item
else:
return
Right ? You take an iterable, a qantity, and while the quantity is not reached, you just have to yield ?
OK, just add @pipe to you take function and it’s pipeable :-)
april 2011 by rygwdn
Interesting Things, Largely Python and Twisted Related: Twisted Conch in 60 Seconds: Introduction to an SSH server
march 2011 by rygwdn
pGreetings once more, readers.nbsp; A year ago I introduced you to a href=http://twistedmatrix.com/documents/current/web/howto/web-in-60/index.htmlthe simplicity of web programming with Twisted/a.nbsp; Since then, I've been eager to try to duplicate the success of the series for another topic.nbsp; Welcome to Twisted strongConch/strong in 60 Seconds./ppa href=http://twistedmatrix.com/trac/wiki/TwistedConchTwisted Conch/a implements the SSH protocol for both clients and servers.nbsp; It also implements client and server applications.nbsp; It is also a library for writing custom SSH client and server software.nbsp; This series will focus on using Twisted Conch as an SSH library, and the first articles will cover writing custom SSH servers, with clients covered later./ppIn this first installment, the goal is to set up an SSH server which will accept SSH client connections butbr /which cannot actually authenticate any users (so no one will be able to log in)./ppBefore getting into any code, the first thing we need for an SSH server is a server key. nbsp;This can be generated using ssh-keygen from OpenSSH or with Conch's own key generation tool, ckeygen. nbsp;Since ample examples of ssh-keygen can be found elsewhere, here's an example of generating a key with ckeygen:/pblockquotepspanexarkun@boson:/tmp$ strongckeygen -t rsa -f id_rsa/strongbr /Generating public/private rsa key pair.br /Enter passphrase (empty for no passphrase):br /Enter same passphrase again:br /Your identification has been saved in id_rsabr /Your public key has been saved in id_rsa.pubbr /The key fingerprint is:br /d5:1d:ba:47:a4:75:1a:fe:d4:ba:46:2e:44:67:0d:8bbr /exarkun@boson:/tmp$ ls -l id_rsa id_rsa.pubbr /-rw------- 1 1000 1000 886 2011-03-23 22:59 strongid_rsa/strongbr /-rwxr-xr-x 1 1000 1000 226 2011-03-23 22:59 strongid_rsa.pub/strongbr /exarkun@boson:/tmp$nbsp;/span/p/blockquotedivHere's where the actual Conch code begins. nbsp;First, some imports:/divpcode/code/pprefrom twisted.conch.ssh.factory import SSHFactory/prep/pbr /spana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.conch.ssh.factory.SSHFactory.htmlSSHFactory/a/spanspannbsp;is an /spanspana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.internet.interfaces.IProtocolFactory.htmlIProtocolFactory/a/spanspannbsp;which is used to create protocol instances to handle new connections that arrive at a listening port. nbsp;/spanspanSSHFactory/spanspannbsp;in particular creates protocol instances that can carry on the server side of an SSH connection. nbsp;We need an instance to hook up to a listening port:/spanbr /code/codeprefactory = SSHFactory()/prep/pbr /Nothing very exciting going on there. nbsp;spanSSHFactory/spannbsp;doesn't take any initializer arguments. nbsp;It does have some useful attributes you can set, though. nbsp;Two important attributes are spanpublicKeys/span and spanprivateKeys/spanspan. nbsp;These define what key the server uses to identify itself to clients. nbsp;A server can have multiple keys, so these attributes are bound to dictionaries. nbsp;In this example we'll just give the server one key though. nbsp;For this, we need to have a couplenbsp;/spanspana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.conch.ssh.keys.Key.htmlKey/a/spanspannbsp;objects:/spanbr /code/codeprefrom twisted.conch.ssh.keys import Keybr /br /with open('id_rsa') as privateBlobFile:br / privateBlob = privateBlobFile.read()br / privateKey = Key.fromString(data=privateBlob)br /br /with open('id_rsa.pub') as publicBlobFile:br / publicBlob = publicBlobFile.read()br / publicKey = Key.fromString(data=publicBlob)br //prep/pdivspanThe /spanspanKey.fromString/spanspannbsp;method can parse several formats, including the standard format keys are kept in by OpenSSH (which is also the format ckeygen emits)./span/divdivNow we can set up the factory to know about these keys:/divdivcode/code/divprefactory.privateKeys = {'ssh-rsa': privateKey}br /factory.publicKeys = {'ssh-rsa': publicKey}br //prep/pdivClients connecting to this server will get to see this public key and then use it to challenge the server to prove it also has the private key by signing some random data correctly./divdivspanOne more attribute needs to be set on the factory. nbsp;/spanspanprivateKeys/spanspan and /spanspanpublicKeys/spanspannbsp;let clients identify the server. nbsp;Now the server needs some way to identify clients. nbsp;This is done using a /spanspana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.cred.portal.Portal.htmlPortal/a/spanspannbsp;from /spanspantwisted.cred/spanspan. nbsp;However, I said this server wouldn't be able to authenticate users, so this example will only set a placeholder value here which will fail all authentication attempts:/span/divprebr /codefrom twisted.cred.portal import Portalbr /factory.portal = Portal(None)/codebr //prepLater we'll revisit this attribute so that users can actually log in to the server./ppThe example is almost done now. nbsp;The only thing left to do is listen on a port with this factory and run the reactor. nbsp;For this we'll need to import the reactor. nbsp;We have several options for how to listen on a port, but for this example I'll use the classic spanreactor.listenTCP/spanspan:/spanbr /code/code/pprefrom twisted.internet import reactorbr /reactor.listenTCP(2022, factory)/prep/pbr /spanThis listens on TCP port 2022. nbsp;Whenever a connection arrives, /spanspanfactory/spanspannbsp;will be used to create a new protocol instance to handle it. nbsp;This factory is going to create a href=http://twistedmatrix.com/documents/10.2.0/api/twisted.conch.ssh.transport.SSHServerTransport.htmlSSHServerTransport/a instances, but don't worry about that for now./spanpspanSince the reactor is the mainloop that drives any Twisted-based application, nothing actually happens until we run it:/spanbr /code/code/pprereactor.run()/prep/pbr /With that, you now have an SSH server. nbsp;It's functional enough to prove its identity to clients that connect to it and even accept authentication attempts from them, but it will always reject their attempts. nbsp;So it's not the strongmost/strongnbsp;useful SSH server (but add a little logging and maybe you have a simple SSH honey pot!), but if you tune in next time, I'll demonstrate how it can be extended to support some more useful authentication options. nbsp;Here's a full code listing for this example:br /code/codeprefrom twisted.cred.portal import Portalbr /from twisted.conch.ssh.factory import SSHFactorybr /from twisted.internet import reactorbr /from twisted.conch.ssh.keys import Keybr /br /with open('id_rsa') as privateBlobFile:br / privateBlob = privateBlobFile.read()br / privateKey = Key.fromString(data=privateBlob)br /br /with open('id_rsa.pub') as publicBlobFile:br / publicBlob = publicBlobFile.read()br / publicKey = Key.fromString(data=publicBlob)br /br /factory = SSHFactory()br /factory.privateKeys = {'ssh-rsa': privateKey}br /factory.publicKeys = {'ssh-rsa': publicKey}br /factory.portal = Portal(None)br /br /reactor.listenTCP(2022, factory)br /reactor.run()br //prep/pdiv class=blogger-post-footerimg width=1 height=1 src=https://blogger.googleusercontent.com/tracker/4350363846291077818-3235296451388020829?l=as.ynchrono.us alt= //div
python
twisted
ssh
march 2011 by rygwdn
dcramer/django-devserver - GitHub
march 2011 by rygwdn
A drop in replacement for Django's built-in runserver command. Features include:
An extendable interface for handling things such as real-time logging.
Integration with the werkzeug interactive debugger.
An improved runserver allowing you to process requests simultaneously.
devserver screenshot
debugging
development
django
localhost
python
An extendable interface for handling things such as real-time logging.
Integration with the werkzeug interactive debugger.
An improved runserver allowing you to process requests simultaneously.
devserver screenshot
march 2011 by rygwdn
Kenneth Reitz – Requests: Python HTTP Module (That Doesn’t Suck)
february 2011 by rygwdn
I’m happy to announce the release of my latest Python module: Requests.
One of the most frustrating experiences I have to face, as a Python developer, is the pain of making HTTP requests. The available APIs are insane. I have to look up everything that I want to do and constantly refer to the documentation. I skip writing fun tools from time to time, simply because I don’t want to go through the pain of making a simple PUT request with Basic Authentication.
Things shouldn’t be this way. Not in Python.
Enter Requests
import requests
Let’s check out the awesome new Convore API.
>>> r = requests.get('https://convore.com/api/account/verify.json')
>>> r.status_code
401
Oops, we need to be authenticated. Let’s add our credentials and try again.
>>> conv_auth = requests.AuthObject('requesttest', 'requesttest')
>>> r = requests.get('https://convore.com/api/account/verify.json', auth=conv_auth)
>>> r.status_code
200
It worked! Let’s poke around.
>>> r.headers['content-type']
'application/json'
>>> r.content
'{"username": "requeststest", "url": "/users/requeststest/", "id": "9408", "img": "censored-long-url"}'
Requests is capable of much more than this, but the API stays this simple throughout.
Features
Requests strives to be as simple yet powerful as possible. It’s designed to fit the 90% use case. You might not be able to add proxies, client caching, and raw socket access, but you will have everything you need most of the time:
Extremely simple GET, HEAD, POST, PUT, DELETE requests
Receive simple response header dictionary, content, and status code
Module-level HTTP Auth registry
Eventlet/Greenlet support
The API is quite simple:
5 main functions: get(), head(), post(), put(), and delete()
Attach params/data
Add HTTP Basic Authentication with a parameter
Add a dictionary of HTTP headers with a parameter
Add a CookieJar with a parameter
Attach File uploads with a parameter
Installation
To install Requests, simply:
$ pip install requests
Or, if you must:
$ easy_install requests
But, you really shouldn’t do that.
Roadmap
Requests currently supports CPython 2.5, 2.6, 2.7, and PyPy-c v1.4.
In the future, Python 3.x will be supported. There are no intentions to support Python 2.4 and older.
If you’d like to help or have a feature suggestion, fork me!
[Source on GitHub]
[Project on PyPi]
python
One of the most frustrating experiences I have to face, as a Python developer, is the pain of making HTTP requests. The available APIs are insane. I have to look up everything that I want to do and constantly refer to the documentation. I skip writing fun tools from time to time, simply because I don’t want to go through the pain of making a simple PUT request with Basic Authentication.
Things shouldn’t be this way. Not in Python.
Enter Requests
import requests
Let’s check out the awesome new Convore API.
>>> r = requests.get('https://convore.com/api/account/verify.json')
>>> r.status_code
401
Oops, we need to be authenticated. Let’s add our credentials and try again.
>>> conv_auth = requests.AuthObject('requesttest', 'requesttest')
>>> r = requests.get('https://convore.com/api/account/verify.json', auth=conv_auth)
>>> r.status_code
200
It worked! Let’s poke around.
>>> r.headers['content-type']
'application/json'
>>> r.content
'{"username": "requeststest", "url": "/users/requeststest/", "id": "9408", "img": "censored-long-url"}'
Requests is capable of much more than this, but the API stays this simple throughout.
Features
Requests strives to be as simple yet powerful as possible. It’s designed to fit the 90% use case. You might not be able to add proxies, client caching, and raw socket access, but you will have everything you need most of the time:
Extremely simple GET, HEAD, POST, PUT, DELETE requests
Receive simple response header dictionary, content, and status code
Module-level HTTP Auth registry
Eventlet/Greenlet support
The API is quite simple:
5 main functions: get(), head(), post(), put(), and delete()
Attach params/data
Add HTTP Basic Authentication with a parameter
Add a dictionary of HTTP headers with a parameter
Add a CookieJar with a parameter
Attach File uploads with a parameter
Installation
To install Requests, simply:
$ pip install requests
Or, if you must:
$ easy_install requests
But, you really shouldn’t do that.
Roadmap
Requests currently supports CPython 2.5, 2.6, 2.7, and PyPy-c v1.4.
In the future, Python 3.x will be supported. There are no intentions to support Python 2.4 and older.
If you’d like to help or have a feature suggestion, fork me!
[Source on GitHub]
[Project on PyPi]
february 2011 by rygwdn
Scaling Python Servers with Worker Processes and Socket Duplication | metachris.org
february 2011 by rygwdn
Developing servers that scale is usually quite tricky, even more so with Python and the absence of worker threads which can run on multiple cpu cores [1]. A possible solution are worker processes that duplicate the client’s socket, a technique that allows the workers to processes requests and send responses directly to the client socket. This approach is particularly useful for long lasting connections with more than one request per session.
You can skip the introduction to jump directly to the Python section. [Update: With this approach I was not able to cleanly close all sockets. Be sure to check with lsof.]
Basics: TCP Servers
A tcp server binds to a specific local port and starts listening for incoming connections. When a client connects to this port, the server accepts the connection and waits for incoming data. Data usually arrives in chunks, and the server tries to find the end of a request by looking for delimiter characters or by a specified length (which might be indicated in the first few bytes as in protocol buffers).
Often one request contains a number of values, which may have to be deserialized with a protocol such as json, protocol buffers, avro, thrift or any of the other established or self-invented serialization protocols. After deserializing the incoming bytestream, it can be processed.
Finally the server may or may not respond to the client’s request, and/or close to socket connection at any time.
c10k
Now, what happens if 10,000 clients try to connect to your server concurrently and start sending requests and, in particular, how will it impact the response time? The C10K Problem is a classic resource which discusses this exact situation and provides a few general approaches, which boil down to:
One thread, one client
One thread, many clients
Build the server code into the kernel
We usually don’t want to do the latter, and it’s a general advise to avoid running thousands of concurrent threads. Therefore the question becomes how to handle many clients within one thread.
Asynchronous Servers
For one thread to manage multiple clients, the sockets have to be handled in an asynchronous fashion. That is, the server provides the operating system with a list of file descriptors, and receives a notification as soon as any socket has an event (read, write, error) ready to process. Operating system calls that provide this functionality include select, poll, epoll and kqueue. This approach makes it possible to develop scalable single-threaded servers, such as Facebook’s tornado webserver (free software, written in Python).
The critical point is processing time per request. If one request takes 1 ms to process and send a response, a single threaded server will have a limit at 1,000 requests per second.
Distributing the Load
There are various approaches for distributing incoming connections in order to reach a higher number of concurrently processed requests.
Load balancers distribute incoming requests across servers, usually proxying all traffic in both directions.
Reverse proxies allow you to run multiple server process on different local ports, and distribute incoming connections across them. This works very well in particular for short lived connections such as HTTP requests. Well known reverse proxies include Nginx, Varnish and Squid. (Wikipedia).
Socket accept preforking is a technique that allows multiple processes to listen on the same port and, in turn, pick up new incoming connections and handle the client sockets independently. This works by having one process open the server socket and then forking itself, which copies all existing file descriptors for the children to use.
man fork: “… The child inherits copies of the parent’s set of open file descriptors.”
Socket accept preforking works very well; popular projects that successfully employ this approach include GUnicorn, Tornado and Apache.
Worker threads process requests in the background while the socket thread can get back to waiting for events from the operating system. They usually listen on a queue from the main thread and receive the client socket’s file descriptor or also the incoming bytestream, process it and send the response back to the client. Developers need to be very careful with locking issues when accessing shared state/variables.
Naturally, worker threads are one of the best explored ways of distributing the work and having the main thread concentrate on the socket events.
Python
Python restricts multiple threads to one Python interpreter instance, thereby forcing multiple threads to share a single cpu. Context switches between the threads take place at latest after 100 Python instructions. Inside Python the controlling mechanism is called the Global Interpreter Lock (GIL) [1].
In a server that means as soon as one worker thread uses the cpu, it steals processing time from the socket handling thread, which makes a traditional worker thread architecture unfeasible. But there is an alternative: worker processes.
[Update: With this approach I was not able to cleanly close all open sockets. Be sure to check with lsof.]
Duplicating Sockets across Processes
Given the ability to pickle a socket, worker processes can do the same thing as worker threads: receive a client’s socket file descriptor and possibly the bytestream, process it, and send a response without requiring a callback to the main socket handler process.
Sockets can be pickled and reconstructed with an almost hidden, undocumented feature of Python’s multiprocessing module: multiprocessing.reduction.
multiprocessing.reduction
I discovered this module recently through an inline comment in one of the examples in the multiprocessing docs, and wanted to find out more:
$ python
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing.reduction
>>> help(multiprocessing.reduction)
Help on module multiprocessing.reduction in multiprocessing:
NAME
multiprocessing.reduction
FILE
/usr/lib/python2.6/multiprocessing/reduction.py
MODULE DOCS
http://docs.python.org/library/multiprocessing.reduction
DESCRIPTION
# Module to allow connection and socket objects to be transferred
# between processes
#
# multiprocessing/reduction.py
#
# Copyright (c) 2006-2008, R Oudkerk --- see COPYING.txt
#
Not much information, but at least we know where to look: multiprocessing/reduction.py
And there it is, starting at line 122:
#
# Functions to be used for pickling/unpickling objects with handles
#
def reduce_handle(handle):
if Popen.thread_is_spawning():
return (None, Popen.duplicate_for_child(handle), True)
dup_handle = duplicate(handle)
_cache.add(dup_handle)
sub_debug('reducing handle %d', handle)
return (_get_listener().address, dup_handle, False)
def rebuild_handle(pickled_data):
address, handle, inherited = pickled_data
if inherited:
return handle
sub_debug('rebuilding handle %d', handle)
conn = Client(address, authkey=current_process().authkey)
conn.send((handle, os.getpid()))
new_handle = recv_handle(conn)
conn.close()
return new_handle
Example Code
Putting it all together, this is how to use the above functions to share sockets with another process in Python:
# Main process
from multiprocessing.reduction import reduce_handle
h = reduce_handle(client_socket.fileno())
pipe_to_worker.send(h)
# Worker process
from multiprocessing.reduction import rebuild_handle
h = pipe.recv()
fd = rebuild_handle(h)
client_socket = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
client_socket.send("hello from the worker process\r\n")
References
Python, Threads and the Global Interpreter Lock (GIL)
The C10K Problem
Global Interpreter Lock
Notes
Thanks to Brian Jones for reading drafts of this post.
Recommended Reading
concurrency
python
You can skip the introduction to jump directly to the Python section. [Update: With this approach I was not able to cleanly close all sockets. Be sure to check with lsof.]
Basics: TCP Servers
A tcp server binds to a specific local port and starts listening for incoming connections. When a client connects to this port, the server accepts the connection and waits for incoming data. Data usually arrives in chunks, and the server tries to find the end of a request by looking for delimiter characters or by a specified length (which might be indicated in the first few bytes as in protocol buffers).
Often one request contains a number of values, which may have to be deserialized with a protocol such as json, protocol buffers, avro, thrift or any of the other established or self-invented serialization protocols. After deserializing the incoming bytestream, it can be processed.
Finally the server may or may not respond to the client’s request, and/or close to socket connection at any time.
c10k
Now, what happens if 10,000 clients try to connect to your server concurrently and start sending requests and, in particular, how will it impact the response time? The C10K Problem is a classic resource which discusses this exact situation and provides a few general approaches, which boil down to:
One thread, one client
One thread, many clients
Build the server code into the kernel
We usually don’t want to do the latter, and it’s a general advise to avoid running thousands of concurrent threads. Therefore the question becomes how to handle many clients within one thread.
Asynchronous Servers
For one thread to manage multiple clients, the sockets have to be handled in an asynchronous fashion. That is, the server provides the operating system with a list of file descriptors, and receives a notification as soon as any socket has an event (read, write, error) ready to process. Operating system calls that provide this functionality include select, poll, epoll and kqueue. This approach makes it possible to develop scalable single-threaded servers, such as Facebook’s tornado webserver (free software, written in Python).
The critical point is processing time per request. If one request takes 1 ms to process and send a response, a single threaded server will have a limit at 1,000 requests per second.
Distributing the Load
There are various approaches for distributing incoming connections in order to reach a higher number of concurrently processed requests.
Load balancers distribute incoming requests across servers, usually proxying all traffic in both directions.
Reverse proxies allow you to run multiple server process on different local ports, and distribute incoming connections across them. This works very well in particular for short lived connections such as HTTP requests. Well known reverse proxies include Nginx, Varnish and Squid. (Wikipedia).
Socket accept preforking is a technique that allows multiple processes to listen on the same port and, in turn, pick up new incoming connections and handle the client sockets independently. This works by having one process open the server socket and then forking itself, which copies all existing file descriptors for the children to use.
man fork: “… The child inherits copies of the parent’s set of open file descriptors.”
Socket accept preforking works very well; popular projects that successfully employ this approach include GUnicorn, Tornado and Apache.
Worker threads process requests in the background while the socket thread can get back to waiting for events from the operating system. They usually listen on a queue from the main thread and receive the client socket’s file descriptor or also the incoming bytestream, process it and send the response back to the client. Developers need to be very careful with locking issues when accessing shared state/variables.
Naturally, worker threads are one of the best explored ways of distributing the work and having the main thread concentrate on the socket events.
Python
Python restricts multiple threads to one Python interpreter instance, thereby forcing multiple threads to share a single cpu. Context switches between the threads take place at latest after 100 Python instructions. Inside Python the controlling mechanism is called the Global Interpreter Lock (GIL) [1].
In a server that means as soon as one worker thread uses the cpu, it steals processing time from the socket handling thread, which makes a traditional worker thread architecture unfeasible. But there is an alternative: worker processes.
[Update: With this approach I was not able to cleanly close all open sockets. Be sure to check with lsof.]
Duplicating Sockets across Processes
Given the ability to pickle a socket, worker processes can do the same thing as worker threads: receive a client’s socket file descriptor and possibly the bytestream, process it, and send a response without requiring a callback to the main socket handler process.
Sockets can be pickled and reconstructed with an almost hidden, undocumented feature of Python’s multiprocessing module: multiprocessing.reduction.
multiprocessing.reduction
I discovered this module recently through an inline comment in one of the examples in the multiprocessing docs, and wanted to find out more:
$ python
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing.reduction
>>> help(multiprocessing.reduction)
Help on module multiprocessing.reduction in multiprocessing:
NAME
multiprocessing.reduction
FILE
/usr/lib/python2.6/multiprocessing/reduction.py
MODULE DOCS
http://docs.python.org/library/multiprocessing.reduction
DESCRIPTION
# Module to allow connection and socket objects to be transferred
# between processes
#
# multiprocessing/reduction.py
#
# Copyright (c) 2006-2008, R Oudkerk --- see COPYING.txt
#
Not much information, but at least we know where to look: multiprocessing/reduction.py
And there it is, starting at line 122:
#
# Functions to be used for pickling/unpickling objects with handles
#
def reduce_handle(handle):
if Popen.thread_is_spawning():
return (None, Popen.duplicate_for_child(handle), True)
dup_handle = duplicate(handle)
_cache.add(dup_handle)
sub_debug('reducing handle %d', handle)
return (_get_listener().address, dup_handle, False)
def rebuild_handle(pickled_data):
address, handle, inherited = pickled_data
if inherited:
return handle
sub_debug('rebuilding handle %d', handle)
conn = Client(address, authkey=current_process().authkey)
conn.send((handle, os.getpid()))
new_handle = recv_handle(conn)
conn.close()
return new_handle
Example Code
Putting it all together, this is how to use the above functions to share sockets with another process in Python:
# Main process
from multiprocessing.reduction import reduce_handle
h = reduce_handle(client_socket.fileno())
pipe_to_worker.send(h)
# Worker process
from multiprocessing.reduction import rebuild_handle
h = pipe.recv()
fd = rebuild_handle(h)
client_socket = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
client_socket.send("hello from the worker process\r\n")
References
Python, Threads and the Global Interpreter Lock (GIL)
The C10K Problem
Global Interpreter Lock
Notes
Thanks to Brian Jones for reading drafts of this post.
Recommended Reading
february 2011 by rygwdn
protobuf - Project Hosting on Google Code
january 2011 by rygwdn
protocol buffer stuff
data
google
opensource
programming
python
january 2011 by rygwdn
Python Package Index : pdbpp 0.6
january 2011 by rygwdn
This module is an extension of the pdb module of the standard library. It is meant to be fully compatible with its predecessor, yet it introduces a number of new features to make your debugging experience as nice as possible.
debugging
pdb
python
january 2011 by rygwdn
Python Keyring Lib
january 2011 by rygwdn
The Python keyring lib provides a easy way to access the system keyring service from python. It can be used in any application that needs safe password storage. It supports OSX, KDE, Gnome and Windows's native password storing services. Besides this, it is shipped with kinds of Python implemented keyring for the left environments.
encryption
python
security
january 2011 by rygwdn
Practical guidelines for beautiful Python code | TurnKey Linux Blog
january 2011 by rygwdn
Every now and then Liraz and I find ourselves chatting about how much we love Python, but more so the lessons we have learned from coding, and how to apply them to create beautiful Python code. I've tried to refine some of the "lessons" into practical guidelines that I apply religiously to all new code I write, and the refactoring of old code I written.
When reading other peoples code it sometimes ties my mind into knots, and on occasions I want to pull my hair out in frustration and disgust. That's not to say I'm perfect, but hopefully these guidelines will benefit others (and indirectly help reduce my hair loss).
I couldn't possibly include everything I wanted to in one post, so this will be the first, and more will follow...
#1 - OO structure == Well defined mental concepts
Object Orientated structure should always map to well defined mental concepts in the problem domain. If you don't have a well defined mental model of the problem domain, start with that. Class Responsibility Collaboration (CRC) cards are really useful in this.
Basically you need a sketch of the architecture. What parts does your system have, what are their names, what does each part do? What parts is that part made out of? How do the parts interact with each other?
You can save quite a bit of effort if you come up with a good architecture up-front, but sometimes it may be easier to start without and figure it out a bit later after you have a bit more knowledge about the problem you are solving.
The rule is that the sooner you refine your architecture, the better. It is easy to dig yourself further and further into a complexity hole that makes restructuring very difficult later on. So do that as early as possible.
Refining the architecture is part of the "refactor mercilessly" rule.
#2 - Leverage built-in Python types
It is often a good idea to build on top of built-in Python types or at least emulate them.
The big advantage in building on top of Python's conventions is that you get to re-use Python's abstractions instead of reinventing your own. Python is famous for its minimal elegance, and the language designers take great care making small beautiful data structures. Making your code structures more Pythonic is a pretty good way to leverage the built-in elegance of the language while saving yourself quite a bit of work (inventing good abstractions is hard!).
If you've never inherited from a built-in data type, experiment with a small test case in a throw away script. This way you don't mangle your current project and you can be as experimental as you like.
Tip: Construction can be a bit trickier than a normal object if your initialization interface is different from the built-in type. In that case you'll need to override the __new__ magic method as well to call the __new__ constructor of the base class with different arguments than your __new__ is called.
For example:
class IntField(int):
def __new__(cls, val, name):
return int.__new__(cls, val)
def __init__(self, val, name):
self.name = name
self.val = val
int.__init__(self, val)
As for emulating built in types, Python provides magical method names for emulating any built-in behavior. See special method names for reference.
#3 - Use the class namespace, Luke!
Use the class namespace when defining class-level attributes and the instance namespace (which inherits from the class namespace on initialization) when defining instance level attributes.
For example, constants are always class level attributes because they are shared by all instances. On a practical level there are two reasons to do this:
Enable inheritance - you can't override instance level attributes, only class level attributes.
Readability - code is communication. Setting an attribute at the class level or instance level is making a statement about the nature of that attribute, which makes the code easier to read and understand.
#4 - staticmethod vs. classmethod vs. regular method
Static methods are simpler and more readable than class methods because they don't have access to the class attribute, so it's easier to determine its input (i.e., arguments), and output (i.e., the return value).
Class methods are simpler than regular methods because the convention is that you don't usually manipulate attributes in the class namespace the way you would manipulate attributes in the instance namespace. So when you call a class method you know its not going to be sneaking around behind you back reading or writing to any instance level attributes set by other methods. The input for a class method is therefore the arguments it receives + class level constants. Its output is the return value.
A regular method is the most complex and easy to abuse type of method. Its harder for the programmer to see what the inputs for the method are, and what its output is.
Lets consider the following case: imagine a class in which all private methods are called without arguments and do not return output values. Instead, the private methods freely read and write to any attributes in self.
The problem with such a pattern, is that all methods have now become entangled in a spaghetti like dependency structure. The instance namespace in effect behaves like a global namespace and it becomes very difficult to understand methods in terms of input and output. Furthermore, the boundaries between methods can easily become fuzzy and unclear.
You must never ever do this (I have in the passed, and I'm ashamed). If your code exhibits this pattern, go fix it, now!
Private methods have input arguments and return values for a reason. The instance namespace must only be used for instance level attributes which need to persist after a public method has been called.
Until it becomes second nature, these rules should help:
If a method doesn't need access to instance level attributes then it should be a class method, not a regular method.
If a method doesn't need access to class level attributes then it should be a static method, not a class method.
The Python Paradox
Paul Graham wrote an interesting essay entitled The Python Paradox. It's a quick read - take a look.
code
programming
python
style
tips
When reading other peoples code it sometimes ties my mind into knots, and on occasions I want to pull my hair out in frustration and disgust. That's not to say I'm perfect, but hopefully these guidelines will benefit others (and indirectly help reduce my hair loss).
I couldn't possibly include everything I wanted to in one post, so this will be the first, and more will follow...
#1 - OO structure == Well defined mental concepts
Object Orientated structure should always map to well defined mental concepts in the problem domain. If you don't have a well defined mental model of the problem domain, start with that. Class Responsibility Collaboration (CRC) cards are really useful in this.
Basically you need a sketch of the architecture. What parts does your system have, what are their names, what does each part do? What parts is that part made out of? How do the parts interact with each other?
You can save quite a bit of effort if you come up with a good architecture up-front, but sometimes it may be easier to start without and figure it out a bit later after you have a bit more knowledge about the problem you are solving.
The rule is that the sooner you refine your architecture, the better. It is easy to dig yourself further and further into a complexity hole that makes restructuring very difficult later on. So do that as early as possible.
Refining the architecture is part of the "refactor mercilessly" rule.
#2 - Leverage built-in Python types
It is often a good idea to build on top of built-in Python types or at least emulate them.
The big advantage in building on top of Python's conventions is that you get to re-use Python's abstractions instead of reinventing your own. Python is famous for its minimal elegance, and the language designers take great care making small beautiful data structures. Making your code structures more Pythonic is a pretty good way to leverage the built-in elegance of the language while saving yourself quite a bit of work (inventing good abstractions is hard!).
If you've never inherited from a built-in data type, experiment with a small test case in a throw away script. This way you don't mangle your current project and you can be as experimental as you like.
Tip: Construction can be a bit trickier than a normal object if your initialization interface is different from the built-in type. In that case you'll need to override the __new__ magic method as well to call the __new__ constructor of the base class with different arguments than your __new__ is called.
For example:
class IntField(int):
def __new__(cls, val, name):
return int.__new__(cls, val)
def __init__(self, val, name):
self.name = name
self.val = val
int.__init__(self, val)
As for emulating built in types, Python provides magical method names for emulating any built-in behavior. See special method names for reference.
#3 - Use the class namespace, Luke!
Use the class namespace when defining class-level attributes and the instance namespace (which inherits from the class namespace on initialization) when defining instance level attributes.
For example, constants are always class level attributes because they are shared by all instances. On a practical level there are two reasons to do this:
Enable inheritance - you can't override instance level attributes, only class level attributes.
Readability - code is communication. Setting an attribute at the class level or instance level is making a statement about the nature of that attribute, which makes the code easier to read and understand.
#4 - staticmethod vs. classmethod vs. regular method
Static methods are simpler and more readable than class methods because they don't have access to the class attribute, so it's easier to determine its input (i.e., arguments), and output (i.e., the return value).
Class methods are simpler than regular methods because the convention is that you don't usually manipulate attributes in the class namespace the way you would manipulate attributes in the instance namespace. So when you call a class method you know its not going to be sneaking around behind you back reading or writing to any instance level attributes set by other methods. The input for a class method is therefore the arguments it receives + class level constants. Its output is the return value.
A regular method is the most complex and easy to abuse type of method. Its harder for the programmer to see what the inputs for the method are, and what its output is.
Lets consider the following case: imagine a class in which all private methods are called without arguments and do not return output values. Instead, the private methods freely read and write to any attributes in self.
The problem with such a pattern, is that all methods have now become entangled in a spaghetti like dependency structure. The instance namespace in effect behaves like a global namespace and it becomes very difficult to understand methods in terms of input and output. Furthermore, the boundaries between methods can easily become fuzzy and unclear.
You must never ever do this (I have in the passed, and I'm ashamed). If your code exhibits this pattern, go fix it, now!
Private methods have input arguments and return values for a reason. The instance namespace must only be used for instance level attributes which need to persist after a public method has been called.
Until it becomes second nature, these rules should help:
If a method doesn't need access to instance level attributes then it should be a class method, not a regular method.
If a method doesn't need access to class level attributes then it should be a static method, not a class method.
The Python Paradox
Paul Graham wrote an interesting essay entitled The Python Paradox. It's a quick read - take a look.
january 2011 by rygwdn
Subgraph enumeration
january 2011 by rygwdn
This is part 1 of a series on subgraph enumeration.
Simple subgraph enumeration algorithm
Faster subgraph enumeration
The final code is available for download. However, bear in
mind that the code in part 2 is about 3.5 times faster.
Graph enumeration (not the point of this article)
How many molecules exist with N atoms? That has no easy answer. A
related graph theory question is "how many simple graphs exist with N
nodes" but those include chemically impossible solutions. This is a
well-studied field. If you want to go down that path, some pointers
are A000602 from the Online
Encyclopedia of Integer Sequences ("Number of n-node unrooted quartic
trees; number of n-carbon alkanes C(n)H(2n+2) ignoring
stereoisomers"), Brendan McKay's with with graph enumeration,
including canonicalization with nauty. (Downstream
implementations include N.I.C.E.
in Sage with specializations in
the canonicalization algorithm in InChI and in Indigo. A paper by
Hartke
and Radcliffe also explains the McKay algorithm.) For something
more chemistry specific, see Mick Kappler's GenSmi
project "to construct all possible compounds containing carbon,
nitrogen, and oxygen with a molecular weight between 150 and 250" and
Orion Jankowski's recent example code of simple
enumeration using Haskell.
It's a very interesting topic, but as far as I can tell it's mostly
theoretical. People conjecture about how enumeration might suggest new
drug-like candidates, but I've not heard of any success along those
lines. If you know of anything, please let me know.
Subgraph enumeration
Instead, I'm going to look at subgraph enumeration. That is, given a
structure, I'll enumerate all its unique subgraphs up to a given size,
and I'll count the number of times each unique subgraph exists.
My reason for doing this comes from my work with fingerprints. The
MACCS keys and the CACTVS substructure keys were made with human
insight into which patterns make good filters. I'm curious if I can
use clustering, a genetic algorithm, or some other means to identify a
set of substructures which are more effective at filtering. I have
access to the target data (eg, PubChem), and I suspect that knowledge
will be helpful.
Subgraph enumeration is not a new subject either, but I haven't come
across much work using it, at least not in chemistry. There are a
couple of papers by Gerta Rücker and Christoph Rücker from 2001:
On
Finding Nonisomorphic Connected Subgraphs and Distinct Molecular
Substructures
A computer program is introduced that first generates
all connected subgraphs and then uses a combination
of well-discriminating graph invariants to eliminate
duplicates.
and
Substructure,
Subgraph, and Walk Counts as Measures of the Complexity of Graphs
and Molecules
Construction of all substructures and subgraphs is
computationally demanding; therefore, two alternatives
are pointed out for the treatment of large sets of
compounds: (i) Often it will suffice to consider
counts of substructures/subgraphs up to a certain
number of edges only, information which is provided
by the program much more rapidly.
but I didn't actually read those papers. (Hmm, should walk up the hill
to the library.)
Initial subgraph enumeration algorithm
Graph algorithms can be tricky so I'm going to start with the simplest
algorithm I can think of. This is deliberately not a fast one - that
will be in my next essay. I'll develop this algorithm first then use
it to cross-check the faster but more complicated one.
I'll start with a collection of "seeds". Each seed is a graph, and
I'll grow the seeds using a bond to make a new seed, which I can use
for future growth.
seeds = an empty collection
For each atom in the molecule:
Mke a new seed with that atom and no bonds
Add the seed to the collection of seeds
While there are seeds:
Select and remove a seed from the collection
For each bond in the molecule:
If the bond is in the seed then:
Skip this bond
Else if the neither atom of the bond is in the seed then:
Skip this bond
Else if (one of the bond atoms is not in the seed and
the seed is at maximum size) then:
Skip this bond
Otherwise:
Create a new seed which is the old seed plus this new bond
Add the new seed to the collection
I think you can see that this will work. There will be duplicates
because if I start from "C#N" with the seeds "C" and "N" then I'll
grow the "C" to "C#N" and I'll grow the "N" to "N#C", but this is a
good first pass.
I know the OpenEye tools the best so I'll use OEChem to implement the
algorithm. There's nothing toolkit-specific about what I'm doing.
from openeye.oechem import *
class Seed(object):
def __init__(self, atoms, bonds):
self.atoms = atoms
self.bonds = bonds
# Add a new bond to the seed when both end atoms are already in the seed
def new_seed_with_bond(seed, bond):
#print "Add bond", bond.GetIdx()
assert bond not in seed.bonds
atoms = seed.atoms
bonds = seed.bonds.copy()
bonds.add(bond)
return Seed(atoms, bonds)
# Add a new atom to the seed, and the bond which connects it to the seed
def new_seed_with_atom_and_bond(seed, atom, bond):
#print "Add atom and bond", atom.GetIdx(), bond.GetIdx()
assert atom not in seed.atoms
assert bond not in seed.bonds
atoms = seed.atoms.copy()
bonds = seed.bonds.copy()
atoms.add(atom)
bonds.add(bond)
return Seed(atoms, bonds)
def print_seed(seed):
for atom in seed.atoms:
print atom.GetIdx(), OEGetAtomicSymbol(atom.GetAtomicNum())
for bond in seed.bonds:
if bond.IsAromatic():
bondtype = "~"
else:
bondtype = "X-=#$"[bond.GetOrder()]
print bond.GetBgnIdx(), bond.GetEndIdx(), bondtype, bond.GetIdx()
seed_count = 0
def report_seed(seed):
global seed_count
seed_count += 1
print "=== New seed (%d) ===" % seed_count
print_seed(seed)
print "==== end ==="
mol = OEGraphMol()
## Various test cases
#OEParseSmiles(mol, "SC=N")
#OEParseSmiles(mol, "SC=N.[Cl-]")
OEParseSmiles(mol, "S1C=N1")
#OEParseSmiles(mol, "c1ccccc1O")
# Maximum number of atoms to allow
k = 10
assert k >= 1, "need a bit more work to handle that case"
# Start with an empty collection
seeds = []
# Set up the initial set of seeds
for atom in mol.GetAtoms():
# Make a new seed with that atom and no bonds
seed = Seed(set([atom]), set())
report_seed(seed)
# Add the seed to the collection of seeds
seeds.append(seed)
# While there are seeds
while seeds:
# Select and remove a seed from the collection
seed = seeds.pop()
# For each bond in the molecule
for bond in mol.GetBonds():
# If the bond is in the seed then skip this bond
if bond in seed.bonds:
continue
new_atom = None
# If neither atom of the bond is in the seed then skip this bond
if bond.GetBgn() not in seed.atoms:
if bond.GetEnd() not in seed.atoms:
continue
new_atom = bond.GetBgn()
else:
if bond.GetEnd() not in seed.atoms:
new_atom = bond.GetEnd()
# If one of the bond atoms is not in the seed
# and the seed is at maximumize, then skip this bond
if new_atom is not None:
if len(seed.atoms) == k:
continue
# Create the new seed with this atom and bond
new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond)
else:
# Create a new seed where the atoms are already in the
# seed but the bond is new
new_seed = new_seed_with_bond(seed, bond)
# Add the new seed to the collection
report_seed(new_seed)
seeds.append(new_seed)
I did purely manual testing to see if the algorithm worked. You can
see traces of my test cases in the different OEParseSmiles calls. I
did have bugs in my original code, like a report_seed(seed) instead of
report_seed(new_seed). My experience with these sorts of algorithms is
that the manual testing I did is good enough to catch the likely
errors. At this point I should turn the manual tests into real tests,
but I'm going to hold off on that until I have a better way to
represent the results.
The uncommented test case is "S1C=N1". I chose that because each atom
has a different symbol, one of the bonds is different than the others,
and because getting the cycle right is important. When I run the above
code I get 33 different subgraphs:
=== New seed (1) ===
0 S
==== end ===
=== New seed (2) ===
1 C
==== end ===
=== New seed (3) ===
2 N
==== end ===
=== New seed (4) ===
2 N
0 S
0 2 - 0
==== end ===
=== New seed (5) ===
2 N
1 C
1 2 = 2
==== end ===
=== New seed (6) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (7) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (8) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (9) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (10) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
==== end ===
=== New seed (11) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
==== end ===
=== New seed (12) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
0 1 - 1
==== end ===
=== New seed (13) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
1 2 = 2
==== end ===
=== New seed (14) ===
1 C
0 S
0 1 - 1
==== end ===
=== New seed (15) ===
1 C
2 N
1 2 = 2
==== end ===
=== New seed (16) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (17) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (18) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (19) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (20) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
==== end ===
=== New seed (21) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
==== end ===
=== New seed (22) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
0 2 - 0
==== end ===
=== New seed (23) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
1 2 = 2[…]
chemistry
graph
python
algorithms
Simple subgraph enumeration algorithm
Faster subgraph enumeration
The final code is available for download. However, bear in
mind that the code in part 2 is about 3.5 times faster.
Graph enumeration (not the point of this article)
How many molecules exist with N atoms? That has no easy answer. A
related graph theory question is "how many simple graphs exist with N
nodes" but those include chemically impossible solutions. This is a
well-studied field. If you want to go down that path, some pointers
are A000602 from the Online
Encyclopedia of Integer Sequences ("Number of n-node unrooted quartic
trees; number of n-carbon alkanes C(n)H(2n+2) ignoring
stereoisomers"), Brendan McKay's with with graph enumeration,
including canonicalization with nauty. (Downstream
implementations include N.I.C.E.
in Sage with specializations in
the canonicalization algorithm in InChI and in Indigo. A paper by
Hartke
and Radcliffe also explains the McKay algorithm.) For something
more chemistry specific, see Mick Kappler's GenSmi
project "to construct all possible compounds containing carbon,
nitrogen, and oxygen with a molecular weight between 150 and 250" and
Orion Jankowski's recent example code of simple
enumeration using Haskell.
It's a very interesting topic, but as far as I can tell it's mostly
theoretical. People conjecture about how enumeration might suggest new
drug-like candidates, but I've not heard of any success along those
lines. If you know of anything, please let me know.
Subgraph enumeration
Instead, I'm going to look at subgraph enumeration. That is, given a
structure, I'll enumerate all its unique subgraphs up to a given size,
and I'll count the number of times each unique subgraph exists.
My reason for doing this comes from my work with fingerprints. The
MACCS keys and the CACTVS substructure keys were made with human
insight into which patterns make good filters. I'm curious if I can
use clustering, a genetic algorithm, or some other means to identify a
set of substructures which are more effective at filtering. I have
access to the target data (eg, PubChem), and I suspect that knowledge
will be helpful.
Subgraph enumeration is not a new subject either, but I haven't come
across much work using it, at least not in chemistry. There are a
couple of papers by Gerta Rücker and Christoph Rücker from 2001:
On
Finding Nonisomorphic Connected Subgraphs and Distinct Molecular
Substructures
A computer program is introduced that first generates
all connected subgraphs and then uses a combination
of well-discriminating graph invariants to eliminate
duplicates.
and
Substructure,
Subgraph, and Walk Counts as Measures of the Complexity of Graphs
and Molecules
Construction of all substructures and subgraphs is
computationally demanding; therefore, two alternatives
are pointed out for the treatment of large sets of
compounds: (i) Often it will suffice to consider
counts of substructures/subgraphs up to a certain
number of edges only, information which is provided
by the program much more rapidly.
but I didn't actually read those papers. (Hmm, should walk up the hill
to the library.)
Initial subgraph enumeration algorithm
Graph algorithms can be tricky so I'm going to start with the simplest
algorithm I can think of. This is deliberately not a fast one - that
will be in my next essay. I'll develop this algorithm first then use
it to cross-check the faster but more complicated one.
I'll start with a collection of "seeds". Each seed is a graph, and
I'll grow the seeds using a bond to make a new seed, which I can use
for future growth.
seeds = an empty collection
For each atom in the molecule:
Mke a new seed with that atom and no bonds
Add the seed to the collection of seeds
While there are seeds:
Select and remove a seed from the collection
For each bond in the molecule:
If the bond is in the seed then:
Skip this bond
Else if the neither atom of the bond is in the seed then:
Skip this bond
Else if (one of the bond atoms is not in the seed and
the seed is at maximum size) then:
Skip this bond
Otherwise:
Create a new seed which is the old seed plus this new bond
Add the new seed to the collection
I think you can see that this will work. There will be duplicates
because if I start from "C#N" with the seeds "C" and "N" then I'll
grow the "C" to "C#N" and I'll grow the "N" to "N#C", but this is a
good first pass.
I know the OpenEye tools the best so I'll use OEChem to implement the
algorithm. There's nothing toolkit-specific about what I'm doing.
from openeye.oechem import *
class Seed(object):
def __init__(self, atoms, bonds):
self.atoms = atoms
self.bonds = bonds
# Add a new bond to the seed when both end atoms are already in the seed
def new_seed_with_bond(seed, bond):
#print "Add bond", bond.GetIdx()
assert bond not in seed.bonds
atoms = seed.atoms
bonds = seed.bonds.copy()
bonds.add(bond)
return Seed(atoms, bonds)
# Add a new atom to the seed, and the bond which connects it to the seed
def new_seed_with_atom_and_bond(seed, atom, bond):
#print "Add atom and bond", atom.GetIdx(), bond.GetIdx()
assert atom not in seed.atoms
assert bond not in seed.bonds
atoms = seed.atoms.copy()
bonds = seed.bonds.copy()
atoms.add(atom)
bonds.add(bond)
return Seed(atoms, bonds)
def print_seed(seed):
for atom in seed.atoms:
print atom.GetIdx(), OEGetAtomicSymbol(atom.GetAtomicNum())
for bond in seed.bonds:
if bond.IsAromatic():
bondtype = "~"
else:
bondtype = "X-=#$"[bond.GetOrder()]
print bond.GetBgnIdx(), bond.GetEndIdx(), bondtype, bond.GetIdx()
seed_count = 0
def report_seed(seed):
global seed_count
seed_count += 1
print "=== New seed (%d) ===" % seed_count
print_seed(seed)
print "==== end ==="
mol = OEGraphMol()
## Various test cases
#OEParseSmiles(mol, "SC=N")
#OEParseSmiles(mol, "SC=N.[Cl-]")
OEParseSmiles(mol, "S1C=N1")
#OEParseSmiles(mol, "c1ccccc1O")
# Maximum number of atoms to allow
k = 10
assert k >= 1, "need a bit more work to handle that case"
# Start with an empty collection
seeds = []
# Set up the initial set of seeds
for atom in mol.GetAtoms():
# Make a new seed with that atom and no bonds
seed = Seed(set([atom]), set())
report_seed(seed)
# Add the seed to the collection of seeds
seeds.append(seed)
# While there are seeds
while seeds:
# Select and remove a seed from the collection
seed = seeds.pop()
# For each bond in the molecule
for bond in mol.GetBonds():
# If the bond is in the seed then skip this bond
if bond in seed.bonds:
continue
new_atom = None
# If neither atom of the bond is in the seed then skip this bond
if bond.GetBgn() not in seed.atoms:
if bond.GetEnd() not in seed.atoms:
continue
new_atom = bond.GetBgn()
else:
if bond.GetEnd() not in seed.atoms:
new_atom = bond.GetEnd()
# If one of the bond atoms is not in the seed
# and the seed is at maximumize, then skip this bond
if new_atom is not None:
if len(seed.atoms) == k:
continue
# Create the new seed with this atom and bond
new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond)
else:
# Create a new seed where the atoms are already in the
# seed but the bond is new
new_seed = new_seed_with_bond(seed, bond)
# Add the new seed to the collection
report_seed(new_seed)
seeds.append(new_seed)
I did purely manual testing to see if the algorithm worked. You can
see traces of my test cases in the different OEParseSmiles calls. I
did have bugs in my original code, like a report_seed(seed) instead of
report_seed(new_seed). My experience with these sorts of algorithms is
that the manual testing I did is good enough to catch the likely
errors. At this point I should turn the manual tests into real tests,
but I'm going to hold off on that until I have a better way to
represent the results.
The uncommented test case is "S1C=N1". I chose that because each atom
has a different symbol, one of the bonds is different than the others,
and because getting the cycle right is important. When I run the above
code I get 33 different subgraphs:
=== New seed (1) ===
0 S
==== end ===
=== New seed (2) ===
1 C
==== end ===
=== New seed (3) ===
2 N
==== end ===
=== New seed (4) ===
2 N
0 S
0 2 - 0
==== end ===
=== New seed (5) ===
2 N
1 C
1 2 = 2
==== end ===
=== New seed (6) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (7) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (8) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (9) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (10) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
==== end ===
=== New seed (11) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
==== end ===
=== New seed (12) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
0 1 - 1
==== end ===
=== New seed (13) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
1 2 = 2
==== end ===
=== New seed (14) ===
1 C
0 S
0 1 - 1
==== end ===
=== New seed (15) ===
1 C
2 N
1 2 = 2
==== end ===
=== New seed (16) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (17) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (18) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (19) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (20) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
==== end ===
=== New seed (21) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
==== end ===
=== New seed (22) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
0 2 - 0
==== end ===
=== New seed (23) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
1 2 = 2[…]
january 2011 by rygwdn
Sneak Preview of New Features in PyFilesystem 0.4
january 2011 by rygwdn
There have been some pretty exciting developments in PyFilesystem since version 0.3 was released – Ryan Kelly and myself have been hard at work, and there have been a number of excellent contributions to the code base from other developers. Version 0.4 will be released some time in January, but I'd like to give you a preview of some new features before the next version lands.
Pyfilesystem is a Python module that provides a simplified common interface to many types of filesystem.
It is now possible to open any of the supported filesystems from a URL in this format, which makes it very easy to specify a filesystem (or individual file) from the command line or a config file. Here's a quick example that opens a bunch of quite different filesystems:
from fs.opener import fsopendir
projects_fs = fsopendir('/projects')
zip_fs = fsopendir('zip:///foo/bar/baz.zip')
ftp_fs = fsopendir('ftp://ftp/mozilla.org')
sftp_fs = fsopendir('sftp://example.org')
If you used Pyfilesystem in your application you could trivially change where your files are physically located, or where you save generated files to.
You can also open a file directly without the need to explicitly open the filesystem it is contained within, with the fsopen function, e.g.:
from fs.opener import fsopen
print fsopen('zip:///foo/bar/baz.zip!dir/somefile.txt').read()
print fsopen('ftp://ftp.mozilla.org/pub/README').read()
fsopen is very similar to the builtin open method and will return a file-like object of some kind. In fact, if you pass in a system path it works as open would (although the exceptions will be an instance of fs.errors.FSError rather than IOError).
Because of this similarity with the builtin, fsopen could be used to shadow open, and instantly add the ability for an application to open files on mediums other than a system drive. This is all it takes:
from fs.opener import fsopen as open
FS Commands
Version 0.4 also adds a number of applications that mirror some of the standard command line apps, but extend their functionality with FS URLs. For example, fsls, functions just like the ls command, but works with any of the supported filesystems:
will@will-linux:~$ fsls .will@will-linux:~$ fsls zip://myzipwill@will-linux:~$ fsls ftp://ftp.mozilla.org/pubwill@will-linux:~$ fsls sftp://user:pass@securesite.com/home/will
You can also copy and move files between filesystems with fscp and fsmv, which work in a very similar manner to their cp and mv counterparts, with a few extensions such as multi-threaded copying (great for network filesystems) and an ascii progress bar. The following example copies all the .py files in my projects directory to zip file on an ftp server, and displays an progress bar to boot.
will@will-linux:~$ fscp ~/projects/*.py zip:ftp://will:password@example.org/backups/code.zip
Then there is the fscat command that writes a file in a filesystem to the terminal. The following example displays a python file in the zip file that we created with the previous command:
will@will-linux:~$ fscat zip:ftp://will:password@example.org/backups/code.zip!pythonrocks.py
The other commands fsmkdir, fsrm and fstree work as you may expect.
Serving
The fsserve command adds the ability to serve any of the supported filesystems over a number of protocols. The default is to serve http – in effect creating a webserver. The following is all it takes to serve the contents of your current working directory:
will@will-linux:~$ fsserve
Now if you point your browser at http://127.0.0.1 you will see a web-page with the contents of your current working directory (or a index.html file if present). It's not the most bullet-proof of web servers, but handy if you quickly want to serve some files. Naturally, fsserve works with any filesystem you pass to it. You could, for instance, serve the contents of a zip file without ever explicitly unzipping it, or create an ftp to http gateway by serving an ftp filesystem. The following command creates a ftp to http gateway for ftp.mozilla.org:
will@will-linux:~$ fsserve ftp://ftp.mozilla.org
You also have the option of serving a filesystem over SFTP (Secure FTP), or by RPC (Remote Procedure Call). Either of these two methods expose all the functionality of the remote filesystem, so you could run a server on one machine and create/move/copy/delete files from another machine on the network (or internet). For example, the following would serve the current working directory on localhost, port 3000:
will@will-linux:~$ fsserve -t rpc -p 3000
You can then connect to that server from another machine on the network. Assuming my local IP is 192.168.1.64 the following would display the directory structure from another machine on my network:
will@will-linux:~$ fstree rpc://192.168.1.64:3000
Mounting
Any of the filesystems can be mounted on the OS with the fsmount command, which uses FUSE on Linux or DOKAN on Windows. The advantage of this is that the filesystems exposed in Python can be used in any application, and browsed with Explorer or Nautilus. The following creates a ram drive on Linux:
will@will-linux:~$ fsmount mem:// mem
Or on Windows:
C:\> fsmount mem:// M
Get the Code
There is no documentation online for the new features as yet, but if you are a brave soul and want to experiment with any of the above commands then download the code from SVN and run python setup.py install. The command line apps all have a -h which which displays help on the various options.
Bear in mind that these commands are still somewhat experimental, and some of these commands have the capacity to delete files – so be careful. That said, I'm confident to use them for my day-to-day work.
Please see the projects page page if you want to report bugs or discuss Pyfilesystem with myself and the other developers.
python
fs
filesystem
Pyfilesystem is a Python module that provides a simplified common interface to many types of filesystem.
It is now possible to open any of the supported filesystems from a URL in this format, which makes it very easy to specify a filesystem (or individual file) from the command line or a config file. Here's a quick example that opens a bunch of quite different filesystems:
from fs.opener import fsopendir
projects_fs = fsopendir('/projects')
zip_fs = fsopendir('zip:///foo/bar/baz.zip')
ftp_fs = fsopendir('ftp://ftp/mozilla.org')
sftp_fs = fsopendir('sftp://example.org')
If you used Pyfilesystem in your application you could trivially change where your files are physically located, or where you save generated files to.
You can also open a file directly without the need to explicitly open the filesystem it is contained within, with the fsopen function, e.g.:
from fs.opener import fsopen
print fsopen('zip:///foo/bar/baz.zip!dir/somefile.txt').read()
print fsopen('ftp://ftp.mozilla.org/pub/README').read()
fsopen is very similar to the builtin open method and will return a file-like object of some kind. In fact, if you pass in a system path it works as open would (although the exceptions will be an instance of fs.errors.FSError rather than IOError).
Because of this similarity with the builtin, fsopen could be used to shadow open, and instantly add the ability for an application to open files on mediums other than a system drive. This is all it takes:
from fs.opener import fsopen as open
FS Commands
Version 0.4 also adds a number of applications that mirror some of the standard command line apps, but extend their functionality with FS URLs. For example, fsls, functions just like the ls command, but works with any of the supported filesystems:
will@will-linux:~$ fsls .will@will-linux:~$ fsls zip://myzipwill@will-linux:~$ fsls ftp://ftp.mozilla.org/pubwill@will-linux:~$ fsls sftp://user:pass@securesite.com/home/will
You can also copy and move files between filesystems with fscp and fsmv, which work in a very similar manner to their cp and mv counterparts, with a few extensions such as multi-threaded copying (great for network filesystems) and an ascii progress bar. The following example copies all the .py files in my projects directory to zip file on an ftp server, and displays an progress bar to boot.
will@will-linux:~$ fscp ~/projects/*.py zip:ftp://will:password@example.org/backups/code.zip
Then there is the fscat command that writes a file in a filesystem to the terminal. The following example displays a python file in the zip file that we created with the previous command:
will@will-linux:~$ fscat zip:ftp://will:password@example.org/backups/code.zip!pythonrocks.py
The other commands fsmkdir, fsrm and fstree work as you may expect.
Serving
The fsserve command adds the ability to serve any of the supported filesystems over a number of protocols. The default is to serve http – in effect creating a webserver. The following is all it takes to serve the contents of your current working directory:
will@will-linux:~$ fsserve
Now if you point your browser at http://127.0.0.1 you will see a web-page with the contents of your current working directory (or a index.html file if present). It's not the most bullet-proof of web servers, but handy if you quickly want to serve some files. Naturally, fsserve works with any filesystem you pass to it. You could, for instance, serve the contents of a zip file without ever explicitly unzipping it, or create an ftp to http gateway by serving an ftp filesystem. The following command creates a ftp to http gateway for ftp.mozilla.org:
will@will-linux:~$ fsserve ftp://ftp.mozilla.org
You also have the option of serving a filesystem over SFTP (Secure FTP), or by RPC (Remote Procedure Call). Either of these two methods expose all the functionality of the remote filesystem, so you could run a server on one machine and create/move/copy/delete files from another machine on the network (or internet). For example, the following would serve the current working directory on localhost, port 3000:
will@will-linux:~$ fsserve -t rpc -p 3000
You can then connect to that server from another machine on the network. Assuming my local IP is 192.168.1.64 the following would display the directory structure from another machine on my network:
will@will-linux:~$ fstree rpc://192.168.1.64:3000
Mounting
Any of the filesystems can be mounted on the OS with the fsmount command, which uses FUSE on Linux or DOKAN on Windows. The advantage of this is that the filesystems exposed in Python can be used in any application, and browsed with Explorer or Nautilus. The following creates a ram drive on Linux:
will@will-linux:~$ fsmount mem:// mem
Or on Windows:
C:\> fsmount mem:// M
Get the Code
There is no documentation online for the new features as yet, but if you are a brave soul and want to experiment with any of the above commands then download the code from SVN and run python setup.py install. The command line apps all have a -h which which displays help on the various options.
Bear in mind that these commands are still somewhat experimental, and some of these commands have the capacity to delete files – so be careful. That said, I'm confident to use them for my day-to-day work.
Please see the projects page page if you want to report bugs or discuss Pyfilesystem with myself and the other developers.
january 2011 by rygwdn
Trails in a Langscape » Blog Archive » Fuzzy string matching - Projects and projections
december 2010 by rygwdn
A damn hot algorithm
I found the following article written by Nick Johnson about the use of finite state machines for approximate string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called “Levenshtein automatons”. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the Levenshtein algorithm which provides sufficient background for Nicks article.
I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:
k = 6
S = "".join(str(s) for s in range(10)*k)
lev = levenshtein_automata(S, k).to_dfa()
The algorithm is advertised as a “damn cool algorithm” by the author but since the major effect on my notebook was producing heat I wonder if “cool” shouldn’t be replaced by “hot”?
In the following article I’m constructing an approximate string matching algorithm from scratch.
Recursive rules for approximate string matching
Let ‘S be a string with len(S)=n and k a positive number with k<=n. By “?” we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length n we can select arbitrary k indexes in the set {0,…,n-1} and substitute the characters of S at those indexes using a wildcard symbol. If for example (S = “food” , k = 1 and index = 2) we get “fo?d”.
We describe the rule which describes all possible character substitutions in “food” like this:
pattern(food, 1) = ?ood | f?od | fo?d | foo?
Applying successive left factorings yields:
pattern(food, 1) = ?ood | f ( ?od | o (?d | o? ) )
This inspires a recursive notation which roughly looks like this:
pattern(food, 1) = ?ood | f pattern(ood, 1)
or more precisely:
pattern(c, 1) = ?
pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)
where we have used a string variable S instead of the concrete string “food”.
When using an arbitrary k instead of a fixed k = 1 we get the following recursive equations:
pattern(c, k) = ?
pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)
Consuming or not consuming?
When we try to find an efficient implementation for the pattern function described above we need an interpretation of the ? wildcard action. It can consume a character and feed the rest of the string into a new call of pattern or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step …
The basic idea when matching a string S1 against a string S2 is that we attempt to match S1[0] against S2[0] and when we succeed, we continue matching S[1:] against S2[1:] using the same constant k. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:
fuzzy_compare(S1, S2[1:], k-1)
fuzzy_compare(S2, S1[1:], k-1)
fuzzy_compare(S1[1:], S2[1:], k-1)
All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.
A first implementation
The following implementation is a good point to start with:
def fuzzy_compare(S1, S2, k, i=0, j=0):
N1 = len(S1)
N2 = len(S2)
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
i+=1
j+=1
continue
except IndexError:
return False
if k == 0:
return False
else:
if fuzzy_compare(S1, S2, k-1, i+1, j):
return True
if fuzzy_compare(S1, S2, k-1, i, j+1):
return True
if fuzzy_compare(S1, S2, k-1, i+1, j+1):
return True
return False
The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.
A second implementation using memoization
Our second implementation still uses recursion but introduces a dictionary which records all (i,j) index pairs that have been encountered and stores the current value of k. If the algorithm finds a value k’ at (i,j) with k’<=k it will immediately return False because this particular trace has been unsuccessfully checked out before. Using ann x n matrix to memoize results will reduce the complexity of the algorithm which becomes O(n^2) where n is the length of the string. In fact it will be even O(n) because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.
def is_visited(i, j, k, visited):
m = visited.get((i,j),-1)
if k>m:
visited[(i,j)] = k
return False
return True
def fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None):
N1 = len(S1)
N2 = len(S2)
if visited is None:
visited = {}
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
visited[(i,j)] = k
i+=1
j+=1
continue
except IndexError:
return False
if k == 0:
return False
else:
if not is_visited(i+1, j, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i+1, j, visited):
return True
if not is_visited(i, j+1, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i, j+1, visited):
return True
if not is_visited(i+1, j+1, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited):
return True
return False
A third implementation eliminating recursion
In our third and final implementation we eliminate the recursive call to fuzzy_compare and replace it using a stack containing tuples (i, j, k) recording the current state.
def is_visited(i, j, k, visited):
m = visited.get((i,j),-1)
if k>m:
visited[(i,j)] = k
return False
return True
def fuzzy_compare(S1, S2, k):
N1 = len(S1)
N2 = len(S2)
i = j = 0
visited = {}
stack = []
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
visited[(i,j)] = k
i+=1
j+=1
continue
except IndexError:
if stack:
i, j, k = stack.pop()
else:
return False
if k == 0:
if stack:
i, j, k = stack.pop()
else:
return False
else:
if not is_visited(i+1, j, k-1, visited):
stack.append((i,j,k))
i, j, k = i+1, j, k-1
elif not is_visited(i, j+1, k-1, visited):
stack.append((i,j,k))
i, j, k = i, j+1, k-1
elif not is_visited(i+1, j+1, k-1, visited):
i, j, k = i+1, j+1, k-1
elif stack:
i, j, k = stack.pop()
else:
return False
This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of if … elif branches can impact performance of the algorithm. Do you see a way to improve it?
Checking the algorithm
When D is the Levenshtein distance between two strings S1 and S2 then fuzzy_compare(S1, S2, k) shall be True for k>=D and False otherwise. So when you want to test fuzzy_compare compute the Levenshtein distance and check fuzzy_compare with the boundary values k = D and k = D-1.
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz] + 1)
return matrix[l2][l1]
For exhaustive testing we define a set of strings as follows:
Given a prescribed n we define the set of strings of length = n which consists of “a” and “b” characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.
def string_set(S = None, k = 0, strings = None, n = 6):
if S is None:
strings = []
S = ["a"]*n
strings.append("".join(S))
for i in range(k, n):
S1 = S[:]
S1[i] = "b"
strings.append("".join(S1))
string_set(S1, i+1, strings, n)
return strings
def string_pairs(n):
L1 = string_set(n=n)
pairs = []
for i in range(len(L1)):
for k in range(1, n+1):
L2 = string_set(n=k)
for j in range(len(L2)):
pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j])))
pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i])))
return pairs
Our test function is s[…]
python
matching
strings
algorithms
I found the following article written by Nick Johnson about the use of finite state machines for approximate string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called “Levenshtein automatons”. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the Levenshtein algorithm which provides sufficient background for Nicks article.
I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:
k = 6
S = "".join(str(s) for s in range(10)*k)
lev = levenshtein_automata(S, k).to_dfa()
The algorithm is advertised as a “damn cool algorithm” by the author but since the major effect on my notebook was producing heat I wonder if “cool” shouldn’t be replaced by “hot”?
In the following article I’m constructing an approximate string matching algorithm from scratch.
Recursive rules for approximate string matching
Let ‘S be a string with len(S)=n and k a positive number with k<=n. By “?” we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length n we can select arbitrary k indexes in the set {0,…,n-1} and substitute the characters of S at those indexes using a wildcard symbol. If for example (S = “food” , k = 1 and index = 2) we get “fo?d”.
We describe the rule which describes all possible character substitutions in “food” like this:
pattern(food, 1) = ?ood | f?od | fo?d | foo?
Applying successive left factorings yields:
pattern(food, 1) = ?ood | f ( ?od | o (?d | o? ) )
This inspires a recursive notation which roughly looks like this:
pattern(food, 1) = ?ood | f pattern(ood, 1)
or more precisely:
pattern(c, 1) = ?
pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)
where we have used a string variable S instead of the concrete string “food”.
When using an arbitrary k instead of a fixed k = 1 we get the following recursive equations:
pattern(c, k) = ?
pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)
Consuming or not consuming?
When we try to find an efficient implementation for the pattern function described above we need an interpretation of the ? wildcard action. It can consume a character and feed the rest of the string into a new call of pattern or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step …
The basic idea when matching a string S1 against a string S2 is that we attempt to match S1[0] against S2[0] and when we succeed, we continue matching S[1:] against S2[1:] using the same constant k. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:
fuzzy_compare(S1, S2[1:], k-1)
fuzzy_compare(S2, S1[1:], k-1)
fuzzy_compare(S1[1:], S2[1:], k-1)
All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.
A first implementation
The following implementation is a good point to start with:
def fuzzy_compare(S1, S2, k, i=0, j=0):
N1 = len(S1)
N2 = len(S2)
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
i+=1
j+=1
continue
except IndexError:
return False
if k == 0:
return False
else:
if fuzzy_compare(S1, S2, k-1, i+1, j):
return True
if fuzzy_compare(S1, S2, k-1, i, j+1):
return True
if fuzzy_compare(S1, S2, k-1, i+1, j+1):
return True
return False
The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.
A second implementation using memoization
Our second implementation still uses recursion but introduces a dictionary which records all (i,j) index pairs that have been encountered and stores the current value of k. If the algorithm finds a value k’ at (i,j) with k’<=k it will immediately return False because this particular trace has been unsuccessfully checked out before. Using ann x n matrix to memoize results will reduce the complexity of the algorithm which becomes O(n^2) where n is the length of the string. In fact it will be even O(n) because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.
def is_visited(i, j, k, visited):
m = visited.get((i,j),-1)
if k>m:
visited[(i,j)] = k
return False
return True
def fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None):
N1 = len(S1)
N2 = len(S2)
if visited is None:
visited = {}
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
visited[(i,j)] = k
i+=1
j+=1
continue
except IndexError:
return False
if k == 0:
return False
else:
if not is_visited(i+1, j, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i+1, j, visited):
return True
if not is_visited(i, j+1, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i, j+1, visited):
return True
if not is_visited(i+1, j+1, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited):
return True
return False
A third implementation eliminating recursion
In our third and final implementation we eliminate the recursive call to fuzzy_compare and replace it using a stack containing tuples (i, j, k) recording the current state.
def is_visited(i, j, k, visited):
m = visited.get((i,j),-1)
if k>m:
visited[(i,j)] = k
return False
return True
def fuzzy_compare(S1, S2, k):
N1 = len(S1)
N2 = len(S2)
i = j = 0
visited = {}
stack = []
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
visited[(i,j)] = k
i+=1
j+=1
continue
except IndexError:
if stack:
i, j, k = stack.pop()
else:
return False
if k == 0:
if stack:
i, j, k = stack.pop()
else:
return False
else:
if not is_visited(i+1, j, k-1, visited):
stack.append((i,j,k))
i, j, k = i+1, j, k-1
elif not is_visited(i, j+1, k-1, visited):
stack.append((i,j,k))
i, j, k = i, j+1, k-1
elif not is_visited(i+1, j+1, k-1, visited):
i, j, k = i+1, j+1, k-1
elif stack:
i, j, k = stack.pop()
else:
return False
This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of if … elif branches can impact performance of the algorithm. Do you see a way to improve it?
Checking the algorithm
When D is the Levenshtein distance between two strings S1 and S2 then fuzzy_compare(S1, S2, k) shall be True for k>=D and False otherwise. So when you want to test fuzzy_compare compute the Levenshtein distance and check fuzzy_compare with the boundary values k = D and k = D-1.
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz] + 1)
return matrix[l2][l1]
For exhaustive testing we define a set of strings as follows:
Given a prescribed n we define the set of strings of length = n which consists of “a” and “b” characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.
def string_set(S = None, k = 0, strings = None, n = 6):
if S is None:
strings = []
S = ["a"]*n
strings.append("".join(S))
for i in range(k, n):
S1 = S[:]
S1[i] = "b"
strings.append("".join(S1))
string_set(S1, i+1, strings, n)
return strings
def string_pairs(n):
L1 = string_set(n=n)
pairs = []
for i in range(len(L1)):
for k in range(1, n+1):
L2 = string_set(n=k)
for j in range(len(L2)):
pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j])))
pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i])))
return pairs
Our test function is s[…]
december 2010 by rygwdn
Mobile Web Development with Django | Imaginary Landscape Blog | Chicago Django and Python Web Development
december 2010 by rygwdn
If you've ever looked into creating your first Django project targeted at mobile devices, you were probably quick to realize that there is no be all, end all solution. Mobile development decisions have to be made with regards to handheld device detection, redirects, how to deal with desktop vs ...
django
mobile
python
development
article
from delicious
december 2010 by rygwdn
Finding C reference leaks using the gc module | Kristján's cosmic percolator
december 2010 by rygwdn
A while back I got a defect assigned to me complaining that clicking a button on our server backend web pages caused the server to freeze. The link was on our “python” page, containing various tools and information on the python interpreter embedded in EVE. The link itself was, interestingly enough, called “Leaky C++”.
Looking at the source code I saw code similar to this:
import gc
def get_leaking_objects_naive():
all = gc.get_objects()
result = []
find = [all]
for i in xrange(len(all)):
if gc.get_referrers(all[i]) == find:
result.append(all[i])
return result
The idea is to find all objects that do not appear to be referenced by any other object and produce a report on those objects. The problem with this code is, however, that it is O(N^2). When there are sufficient objects in the system, this code takes forever to run.
I disabled this code and thought nothing more of it. But recently, it occurred to me that the idea might not be too bad and whether there might be a better way to do this. It turns out there is. Instead of finding the referrers in this manner, we use another gc method, gc.get_referents() that returns objects immediately referred to by an object. This is an O(1) operation and by repeately using it and eliminating objects that are such targets we can weed out everything that has referrers. We then end up with the list of objects that have no referrers, in one fell O(N) swoop:
def get_leaking_objects():
#create a dict of ids to objects
all = dict((id(i), i) for i in gc.get_objects())
#find all the objects that aren't referred to by any other object
ids = set(all.keys())
for i in all.values():
ids.difference_update(id(j) for j in gc.get_referents(i))
#this then is our set of objects without referrers
return [all[i] for i in ids]
This turns out to work surprisingly well. Combined with a object hierarchy browser, this allows us to find suspicious objects, identify them and thus home in on the C code that may be causing trouble.
There is a caveat to this, and it is that gc.get_objects() and gc.get_referents() are documented to only return objects that can be part of a reference cycle. So your leaking strings and integers won’t show up using this tool.
update:
I just made two improvements to the code.
It is faster and uses less memory to skip creating a dict out of the objects.
We must make sure not to leave cyclic references lying about. the “all” variable contains the current function frame so unless we clear it, the frame and its contents (including “all”) isn’t released immediately.
def get_leaking_objects2():
all = gc.get_objects()
try:
ids = set(id(i) for i in all)
for i in all:
ids.difference_update(id(j) for j in gc.get_referents(i))
#this then is our set of objects without referrers
return [i for i in all if id(i) in ids]
finally:
all = i = j = None #clear cyclic references to frame
code
debugging
memory
python
leaks
from delicious
Looking at the source code I saw code similar to this:
import gc
def get_leaking_objects_naive():
all = gc.get_objects()
result = []
find = [all]
for i in xrange(len(all)):
if gc.get_referrers(all[i]) == find:
result.append(all[i])
return result
The idea is to find all objects that do not appear to be referenced by any other object and produce a report on those objects. The problem with this code is, however, that it is O(N^2). When there are sufficient objects in the system, this code takes forever to run.
I disabled this code and thought nothing more of it. But recently, it occurred to me that the idea might not be too bad and whether there might be a better way to do this. It turns out there is. Instead of finding the referrers in this manner, we use another gc method, gc.get_referents() that returns objects immediately referred to by an object. This is an O(1) operation and by repeately using it and eliminating objects that are such targets we can weed out everything that has referrers. We then end up with the list of objects that have no referrers, in one fell O(N) swoop:
def get_leaking_objects():
#create a dict of ids to objects
all = dict((id(i), i) for i in gc.get_objects())
#find all the objects that aren't referred to by any other object
ids = set(all.keys())
for i in all.values():
ids.difference_update(id(j) for j in gc.get_referents(i))
#this then is our set of objects without referrers
return [all[i] for i in ids]
This turns out to work surprisingly well. Combined with a object hierarchy browser, this allows us to find suspicious objects, identify them and thus home in on the C code that may be causing trouble.
There is a caveat to this, and it is that gc.get_objects() and gc.get_referents() are documented to only return objects that can be part of a reference cycle. So your leaking strings and integers won’t show up using this tool.
update:
I just made two improvements to the code.
It is faster and uses less memory to skip creating a dict out of the objects.
We must make sure not to leave cyclic references lying about. the “all” variable contains the current function frame so unless we clear it, the frame and its contents (including “all”) isn’t released immediately.
def get_leaking_objects2():
all = gc.get_objects()
try:
ids = set(id(i) for i in all)
for i in all:
ids.difference_update(id(j) for j in gc.get_referents(i))
#this then is our set of objects without referrers
return [i for i in all if id(i) in ids]
finally:
all = i = j = None #clear cyclic references to frame
december 2010 by rygwdn
related tags
3d ⊕ addons ⊕ ai ⊕ ajax ⊕ algorithm ⊕ algorithms ⊕ amazon ⊕ animation ⊕ ansi ⊕ apache ⊕ api ⊕ appengine ⊕ apple ⊕ apps ⊕ article ⊕ ascii ⊕ async ⊕ audio ⊕ automation ⊕ badges ⊕ bash ⊕ bayesian ⊕ bazaar ⊕ bdd ⊕ benchmark ⊕ benchmarks ⊕ blog ⊕ blogs ⊕ book ⊕ books ⊕ boost ⊕ borg ⊕ bridge ⊕ browser ⊕ bugs ⊕ bugtracking ⊕ build ⊕ buildbot ⊕ buildout ⊕ bundle ⊕ business ⊕ bzr ⊕ c ⊕ c++ ⊕ cache ⊕ cachegrind ⊕ calculator ⊕ calendar ⊕ charts ⊕ cheatsheet ⊕ cheesecake ⊕ chemistry ⊕ cherrypy ⊕ cli ⊕ cloud ⊕ cluster ⊕ cms ⊕ cocoa ⊕ code ⊕ codereview ⊕ coding ⊕ collaboration ⊕ color ⊕ commands ⊕ comparison ⊕ compiler ⊕ completion ⊕ complexity ⊕ computer ⊕ concurrency ⊕ configuration ⊕ console ⊕ conversion ⊕ convert ⊕ converter ⊕ cool ⊕ coroutines ⊕ cpp ⊕ cpu ⊕ crash ⊕ crawler ⊕ cron ⊕ css ⊕ ctypes ⊕ cucumber ⊕ cython ⊕ daemon ⊕ data ⊕ database ⊕ datastructures ⊕ date ⊕ db ⊕ dba ⊕ dbus ⊕ deb ⊕ debian ⊕ debug ⊕ debugger ⊕ debugging ⊕ deploy ⊕ deployment ⊕ design ⊕ designpatterns ⊕ desktop ⊕ development ⊕ diff ⊕ distributed ⊕ distribution ⊕ distutils ⊕ diy ⊕ django ⊕ documentation ⊕ dot ⊕ dotnet ⊕ dvcs ⊕ easy_install ⊕ editor ⊕ education ⊕ emacs ⊕ email ⊕ embedded ⊕ emulator ⊕ encryption ⊕ engineering ⊕ english ⊕ esky ⊕ excel ⊕ exe ⊕ exercises ⊕ extension ⊕ extensions ⊕ fabric ⊕ facebook ⊕ filesystem ⊕ firebug ⊕ firefox ⊕ flash ⊕ flask ⊕ flatblocks ⊕ format ⊕ formatting ⊕ forms ⊕ framework ⊕ free ⊕ fs ⊕ fun ⊕ functional ⊕ funny ⊕ fuse ⊕ futures ⊕ gallery ⊕ game ⊕ gamedev ⊕ games ⊕ gaming ⊕ gdb ⊕ geek ⊕ generator ⊕ genetic ⊕ genetic-algorithms ⊕ git ⊕ github ⊕ gnome ⊕ gnome-do ⊕ go ⊕ google ⊕ grammar ⊕ graph ⊕ graphics ⊕ graphs ⊕ graphviz ⊕ grid ⊕ gtd ⊕ gtk ⊕ gui ⊕ guide ⊕ hack ⊕ hacking ⊕ haml ⊕ hardware ⊕ haskell ⊕ hdl ⊕ hg ⊕ homebrew ⊕ hosting ⊕ howto ⊕ html ⊕ html5 ⊕ hudson ⊕ humor ⊕ ical ⊕ icalendar ⊕ ide ⊕ image ⊕ images ⊕ input ⊕ inspiration ⊕ install ⊕ installation ⊕ interesting ⊕ interview ⊕ intro ⊕ ios ⊕ iphone ⊕ ironpython ⊕ itcup ⊕ java ⊕ javascript ⊕ jit ⊕ jquery ⊕ js ⊕ json ⊕ jython ⊕ kids ⊕ language ⊕ languages ⊕ latex ⊕ lazy ⊕ leaks ⊕ learn ⊕ learning ⊕ libraries ⊕ library ⊕ libxml2 ⊕ lifestream ⊕ linux ⊕ list ⊕ llvm ⊕ loadtesting ⊕ localhost ⊕ logging ⊕ mac ⊕ machinelearning ⊕ make ⊕ management ⊕ mapreduce ⊕ maps ⊕ matching ⊕ math ⊕ measure ⊕ memory ⊕ mercurial ⊕ merge ⊕ microsoft ⊕ middleware ⊕ migration ⊕ mobile ⊕ mock ⊕ modeling ⊕ modules ⊕ monad ⊕ monitoring ⊕ multitouch ⊕ navigation ⊕ network ⊕ networking ⊕ networks ⊕ nose ⊕ notes ⊕ nvidia ⊕ odt ⊕ office ⊕ oneliners ⊕ online ⊕ open-source ⊕ opencl ⊕ opencv ⊕ opengl ⊕ openoffice ⊕ opensource ⊕ optimization ⊕ options ⊕ oracle ⊕ osx ⊕ package ⊕ packages ⊕ packaging ⊕ parallel ⊕ parser ⊕ parsing ⊕ password ⊕ passwords ⊕ path ⊕ patterns ⊕ paver ⊕ pdb ⊕ pdf ⊕ pep ⊕ performance ⊕ photo ⊕ php ⊕ pil ⊕ pip ⊕ plone ⊕ plugin ⊕ plugins ⊕ presentation ⊕ process ⊕ productivity ⊕ profiling ⊕ programming ⊕ proj-haml ⊕ proj-mysite ⊕ proj-tracey ⊕ project ⊕ promises ⊕ proxy ⊕ publishing ⊕ py2app ⊕ py2exe ⊕ pycon ⊕ pygtk ⊕ pylint ⊕ pylons ⊕ pypi ⊕ pypy ⊕ pyramid ⊕ pytest ⊕ python ⊖ qt ⊕ queue ⊕ quickly ⊕ rails ⊕ rally ⊕ ram ⊕ random ⊕ rant ⊕ reader ⊕ realtime ⊕ refactoring ⊕ reference ⊕ regexp ⊕ registration ⊕ remote ⊕ repository ⊕ repoze ⊕ research ⊕ resources ⊕ rest ⊕ review ⊕ rss ⊕ rst ⊕ rst2pdf ⊕ rsync ⊕ ruby ⊕ s3 ⊕ sage ⊕ sanitize ⊕ school ⊕ science ⊕ scipy ⊕ scm ⊕ scraping ⊕ screen ⊕ screencast ⊕ script ⊕ search ⊕ security ⊕ selenium ⊕ server ⊕ setup ⊕ shell ⊕ shipping ⊕ shuffle ⊕ signals ⊕ simulation ⊕ singleton ⊕ site ⊕ skeleton ⊕ snippets ⊕ social ⊕ software ⊕ sorting ⊕ sound ⊕ source ⊕ spider ⊕ spreadsheet ⊕ sql ⊕ sqlalchemy ⊕ sqlite ⊕ ssh ⊕ stackless ⊕ starter ⊕ statistics ⊕ stats ⊕ storage ⊕ strftime ⊕ strings ⊕ study ⊕ style ⊕ success ⊕ sysadmin ⊕ tablet ⊕ tabs ⊕ tagging ⊕ tdd ⊕ tech ⊕ technology ⊕ template ⊕ templates ⊕ templating ⊕ terminal ⊕ test ⊕ testing ⊕ tex ⊕ text ⊕ theme ⊕ themes ⊕ thread ⊕ threading ⊕ time ⊕ tips ⊕ todo ⊕ tomboy ⊕ toodledo ⊕ tool ⊕ tools ⊕ toread ⊕ tutorial ⊕ tutorials ⊕ twisted ⊕ typing ⊕ typography ⊕ ubuntu ⊕ ui ⊕ uml ⊕ unittesting ⊕ unix ⊕ unladen ⊕ update ⊕ updates ⊕ valgrind ⊕ vcs ⊕ vector ⊕ versioncontrol ⊕ versioning ⊕ vi ⊕ video ⊕ videos ⊕ vim ⊕ virtualenv ⊕ visualization ⊕ wacom ⊕ weave ⊕ web ⊕ webcam ⊕ webdesign ⊕ webdev ⊕ webhosting ⊕ webserver ⊕ wii ⊕ wiki ⊕ window ⊕ windowmanager ⊕ windows ⊕ wine ⊕ word ⊕ wordpress ⊕ work ⊕ writing ⊕ wsgi ⊕ x11 ⊕ xls ⊕ xml ⊕ xpath ⊕ xslt ⊕ zsh ⊕Copy this bookmark: