rygwdn + python   418

Pipe: Infix syntax for Python | {Dev Tricks}
Pipe is a Python module enabling infix syntax in Python.
For those asking “Why ?” let’s take an example :

Compare the readability of the classical prefix syntax :

sum(select(where(take_while(fib(), lambda x: x < 1000000) lambda x: x % 2), lambda x: x * x))

And the infix syntax :

fib() | take_while(lambda x: x < 1000000) \
| where(lambda x: x % 2) \
| select(lambda x: x * x) \
| sum()

Isn’t the infix syntax more readable ?

The base class of Pipe is kept simple (7 lines of python) and is usable as a decorator permitting you to create new ‘pipeable’ functions easily. The module provides ~30 prepared pipes functions like ‘where’, ‘group_by’, ‘sort’, ‘take_while’ … A pipeable function takes an iterable (tuple, list, generator) and yields to be itself an iterator, so pipeable function can be piped together.

Let me introduce the basic usage of the Pipe module, then I’ll write some bits on how to build new ones :

To start, get it from PyPI http://pypi.python.org/pypi/pipe/1.3 and install it, open a REPL, import pipe, and play :

Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pipe import *
>>> [1, 2, 3, 4, 5] | add
15
>>> [5, 4, 3, 2, 1] | sort
[1, 2, 3, 4, 5]

Until here it’s easy, to know more about available pipes, just read the help(pipe) in the REPL, all are explained with an example as a doctest

Now as we know that pipeable functions use iterables, we can try to pipe together two or more pipeables :

>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | concat
'1, 3, 5'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | concat
'3, 5'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | select(lambda x: x * x) | concat
'9, 25'
>>> [1, 2, 3, 4, 5] | where(lambda x: x % 2) | tail(2) | select(lambda x: x * x) | add
34

Now, a bit about lazyness, as Pipe use iterables, the evaluation of a whole Pipe is lazy, so we can play with infinite generators like this one :

>>> def fib():
... x = 1
... yield 1
... y = 1
... yield 1
... while True:
... x = x + y
... yield x
... y = x + y
... yield y

Now we can do every kind of stuff into the fibonacci sequence, like solving the 2nd problem of http://projecteuler.net in a readable one liner :

Find the sum of all the even-valued terms in Fibonacci which do not exceed four million.

>>> euler2 = fib() | where(lambda x: x % 2 == 0) | take_while(lambda x: x < 4000000) | add
>>> assert euler2 == 4613732

Isn it pretty ?

Let now see how to create new pipeable functions using the @pipe decorator :
You want to create a function that yields the first x elements from its input
You want its usage to be (1, 2, 3, 4, 5) | take(2) to take the fist 2 elements.
I know that you are thinking about a basic implementation like :

def take(iterable, qte):
for item in iterable:
if qte > 0:
qte -= 1
yield item
else:
return

Right ? You take an iterable, a qantity, and while the quantity is not reached, you just have to yield ?
OK, just add @pipe to you take function and it’s pipeable :-)
code  library  programming  python 
april 2011 by rygwdn
Interesting Things, Largely Python and Twisted Related: Twisted Conch in 60 Seconds: Introduction to an SSH server
pGreetings once more, readers.nbsp; A year ago I introduced you to a href=http://twistedmatrix.com/documents/current/web/howto/web-in-60/index.htmlthe simplicity of web programming with Twisted/a.nbsp; Since then, I've been eager to try to duplicate the success of the series for another topic.nbsp; Welcome to Twisted strongConch/strong in 60 Seconds./ppa href=http://twistedmatrix.com/trac/wiki/TwistedConchTwisted Conch/a implements the SSH protocol for both clients and servers.nbsp; It also implements client and server applications.nbsp; It is also a library for writing custom SSH client and server software.nbsp; This series will focus on using Twisted Conch as an SSH library, and the first articles will cover writing custom SSH servers, with clients covered later./ppIn this first installment, the goal is to set up an SSH server which will accept SSH client connections butbr /which cannot actually authenticate any users (so no one will be able to log in)./ppBefore getting into any code, the first thing we need for an SSH server is a server key. nbsp;This can be generated using ssh-keygen from OpenSSH or with Conch's own key generation tool, ckeygen. nbsp;Since ample examples of ssh-keygen can be found elsewhere, here's an example of generating a key with ckeygen:/pblockquotepspanexarkun@boson:/tmp$ strongckeygen -t rsa -f id_rsa/strongbr /Generating public/private rsa key pair.br /Enter passphrase (empty for no passphrase):br /Enter same passphrase again:br /Your identification has been saved in id_rsabr /Your public key has been saved in id_rsa.pubbr /The key fingerprint is:br /d5:1d:ba:47:a4:75:1a:fe:d4:ba:46:2e:44:67:0d:8bbr /exarkun@boson:/tmp$ ls -l id_rsa id_rsa.pubbr /-rw------- 1 1000 1000 886 2011-03-23 22:59 strongid_rsa/strongbr /-rwxr-xr-x 1 1000 1000 226 2011-03-23 22:59 strongid_rsa.pub/strongbr /exarkun@boson:/tmp$nbsp;/span/p/blockquotedivHere's where the actual Conch code begins. nbsp;First, some imports:/divpcode/code/pprefrom twisted.conch.ssh.factory import SSHFactory/prep/pbr /spana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.conch.ssh.factory.SSHFactory.htmlSSHFactory/a/spanspannbsp;is an /spanspana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.internet.interfaces.IProtocolFactory.htmlIProtocolFactory/a/spanspannbsp;which is used to create protocol instances to handle new connections that arrive at a listening port. nbsp;/spanspanSSHFactory/spanspannbsp;in particular creates protocol instances that can carry on the server side of an SSH connection. nbsp;We need an instance to hook up to a listening port:/spanbr /code/codeprefactory = SSHFactory()/prep/pbr /Nothing very exciting going on there. nbsp;spanSSHFactory/spannbsp;doesn't take any initializer arguments. nbsp;It does have some useful attributes you can set, though. nbsp;Two important attributes are spanpublicKeys/span and spanprivateKeys/spanspan. nbsp;These define what key the server uses to identify itself to clients. nbsp;A server can have multiple keys, so these attributes are bound to dictionaries. nbsp;In this example we'll just give the server one key though. nbsp;For this, we need to have a couplenbsp;/spanspana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.conch.ssh.keys.Key.htmlKey/a/spanspannbsp;objects:/spanbr /code/codeprefrom twisted.conch.ssh.keys import Keybr /br /with open('id_rsa') as privateBlobFile:br / privateBlob = privateBlobFile.read()br / privateKey = Key.fromString(data=privateBlob)br /br /with open('id_rsa.pub') as publicBlobFile:br / publicBlob = publicBlobFile.read()br / publicKey = Key.fromString(data=publicBlob)br //prep/pdivspanThe /spanspanKey.fromString/spanspannbsp;method can parse several formats, including the standard format keys are kept in by OpenSSH (which is also the format ckeygen emits)./span/divdivNow we can set up the factory to know about these keys:/divdivcode/code/divprefactory.privateKeys = {'ssh-rsa': privateKey}br /factory.publicKeys = {'ssh-rsa': publicKey}br //prep/pdivClients connecting to this server will get to see this public key and then use it to challenge the server to prove it also has the private key by signing some random data correctly./divdivspanOne more attribute needs to be set on the factory. nbsp;/spanspanprivateKeys/spanspan and /spanspanpublicKeys/spanspannbsp;let clients identify the server. nbsp;Now the server needs some way to identify clients. nbsp;This is done using a /spanspana href=http://twistedmatrix.com/documents/10.2.0/api/twisted.cred.portal.Portal.htmlPortal/a/spanspannbsp;from /spanspantwisted.cred/spanspan. nbsp;However, I said this server wouldn't be able to authenticate users, so this example will only set a placeholder value here which will fail all authentication attempts:/span/divprebr /codefrom twisted.cred.portal import Portalbr /factory.portal = Portal(None)/codebr //prepLater we'll revisit this attribute so that users can actually log in to the server./ppThe example is almost done now. nbsp;The only thing left to do is listen on a port with this factory and run the reactor. nbsp;For this we'll need to import the reactor. nbsp;We have several options for how to listen on a port, but for this example I'll use the classic spanreactor.listenTCP/spanspan:/spanbr /code/code/pprefrom twisted.internet import reactorbr /reactor.listenTCP(2022, factory)/prep/pbr /spanThis listens on TCP port 2022. nbsp;Whenever a connection arrives, /spanspanfactory/spanspannbsp;will be used to create a new protocol instance to handle it. nbsp;This factory is going to create a href=http://twistedmatrix.com/documents/10.2.0/api/twisted.conch.ssh.transport.SSHServerTransport.htmlSSHServerTransport/a instances, but don't worry about that for now./spanpspanSince the reactor is the mainloop that drives any Twisted-based application, nothing actually happens until we run it:/spanbr /code/code/pprereactor.run()/prep/pbr /With that, you now have an SSH server. nbsp;It's functional enough to prove its identity to clients that connect to it and even accept authentication attempts from them, but it will always reject their attempts. nbsp;So it's not the strongmost/strongnbsp;useful SSH server (but add a little logging and maybe you have a simple SSH honey pot!), but if you tune in next time, I'll demonstrate how it can be extended to support some more useful authentication options. nbsp;Here's a full code listing for this example:br /code/codeprefrom twisted.cred.portal import Portalbr /from twisted.conch.ssh.factory import SSHFactorybr /from twisted.internet import reactorbr /from twisted.conch.ssh.keys import Keybr /br /with open('id_rsa') as privateBlobFile:br / privateBlob = privateBlobFile.read()br / privateKey = Key.fromString(data=privateBlob)br /br /with open('id_rsa.pub') as publicBlobFile:br / publicBlob = publicBlobFile.read()br / publicKey = Key.fromString(data=publicBlob)br /br /factory = SSHFactory()br /factory.privateKeys = {'ssh-rsa': privateKey}br /factory.publicKeys = {'ssh-rsa': publicKey}br /factory.portal = Portal(None)br /br /reactor.listenTCP(2022, factory)br /reactor.run()br //prep/pdiv class=blogger-post-footerimg width=1 height=1 src=https://blogger.googleusercontent.com/tracker/4350363846291077818-3235296451388020829?l=as.ynchrono.us alt= //div
python  twisted  ssh 
march 2011 by rygwdn
dcramer/django-devserver - GitHub
A drop in replacement for Django's built-in runserver command. Features include:

An extendable interface for handling things such as real-time logging.
Integration with the werkzeug interactive debugger.
An improved runserver allowing you to process requests simultaneously.

devserver screenshot
debugging  development  django  localhost  python 
march 2011 by rygwdn
Kenneth Reitz – Requests: Python HTTP Module (That Doesn’t Suck)
I’m happy to announce the release of my latest Python module: Requests.

One of the most frustrating experiences I have to face, as a Python developer, is the pain of making HTTP requests. The available APIs are insane. I have to look up everything that I want to do and constantly refer to the documentation. I skip writing fun tools from time to time, simply because I don’t want to go through the pain of making a simple PUT request with Basic Authentication.

Things shouldn’t be this way. Not in Python.

Enter Requests

import requests

Let’s check out the awesome new Convore API.

>>> r = requests.get('https://convore.com/api/account/verify.json')
>>> r.status_code
401

Oops, we need to be authenticated. Let’s add our credentials and try again.

>>> conv_auth = requests.AuthObject('requesttest', 'requesttest')
>>> r = requests.get('https://convore.com/api/account/verify.json', auth=conv_auth)
>>> r.status_code
200

It worked! Let’s poke around.

>>> r.headers['content-type']
'application/json'
 
>>> r.content
'{"username": "requeststest", "url": "/users/requeststest/", "id": "9408", "img": "censored-long-url"}'

Requests is capable of much more than this, but the API stays this simple throughout.

Features

Requests strives to be as simple yet powerful as possible. It’s designed to fit the 90% use case. You might not be able to add proxies, client caching, and raw socket access, but you will have everything you need most of the time:

Extremely simple GET, HEAD, POST, PUT, DELETE requests
Receive simple response header dictionary, content, and status code
Module-level HTTP Auth registry
Eventlet/Greenlet support

The API is quite simple:

5 main functions: get(), head(), post(), put(), and delete()
Attach params/data
Add HTTP Basic Authentication with a parameter
Add a dictionary of HTTP headers with a parameter
Add a CookieJar with a parameter
Attach File uploads with a parameter

Installation

To install Requests, simply:

$ pip install requests

Or, if you must:

$ easy_install requests

But, you really shouldn’t do that.

Roadmap

Requests currently supports CPython 2.5, 2.6, 2.7, and PyPy-c v1.4.

In the future, Python 3.x will be supported. There are no intentions to support Python 2.4 and older.

If you’d like to help or have a feature suggestion, fork me!

[Source on GitHub]
[Project on PyPi]
python 
february 2011 by rygwdn
Scaling Python Servers with Worker Processes and Socket Duplication | metachris.org
Developing servers that scale is usually quite tricky, even more so with Python and the absence of worker threads which can run on multiple cpu cores [1]. A possible solution are worker processes that duplicate the client’s socket, a technique that allows the workers to processes requests and send responses directly to the client socket. This approach is particularly useful for long lasting connections with more than one request per session.

You can skip the introduction to jump directly to the Python section. [Update: With this approach I was not able to cleanly close all sockets. Be sure to check with lsof.]

Basics: TCP Servers
A tcp server binds to a specific local port and starts listening for incoming connections. When a client connects to this port, the server accepts the connection and waits for incoming data. Data usually arrives in chunks, and the server tries to find the end of a request by looking for delimiter characters or by a specified length (which might be indicated in the first few bytes as in protocol buffers).

Often one request contains a number of values, which may have to be deserialized with a protocol such as json, protocol buffers, avro, thrift or any of the other established or self-invented serialization protocols. After deserializing the incoming bytestream, it can be processed.

Finally the server may or may not respond to the client’s request, and/or close to socket connection at any time.

c10k
Now, what happens if 10,000 clients try to connect to your server concurrently and start sending requests and, in particular, how will it impact the response time? The C10K Problem is a classic resource which discusses this exact situation and provides a few general approaches, which boil down to:

One thread, one client
One thread, many clients
Build the server code into the kernel

We usually don’t want to do the latter, and it’s a general advise to avoid running thousands of concurrent threads. Therefore the question becomes how to handle many clients within one thread.

Asynchronous Servers
For one thread to manage multiple clients, the sockets have to be handled in an asynchronous fashion. That is, the server provides the operating system with a list of file descriptors, and receives a notification as soon as any socket has an event (read, write, error) ready to process. Operating system calls that provide this functionality include select, poll, epoll and kqueue. This approach makes it possible to develop scalable single-threaded servers, such as Facebook’s tornado webserver (free software, written in Python).

The critical point is processing time per request. If one request takes 1 ms to process and send a response, a single threaded server will have a limit at 1,000 requests per second.

Distributing the Load
There are various approaches for distributing incoming connections in order to reach a higher number of concurrently processed requests.

Load balancers distribute incoming requests across servers, usually proxying all traffic in both directions.

Reverse proxies allow you to run multiple server process on different local ports, and distribute incoming connections across them. This works very well in particular for short lived connections such as HTTP requests. Well known reverse proxies include Nginx, Varnish and Squid. (Wikipedia).

Socket accept preforking is a technique that allows multiple processes to listen on the same port and, in turn, pick up new incoming connections and handle the client sockets independently. This works by having one process open the server socket and then forking itself, which copies all existing file descriptors for the children to use.

man fork: “… The child inherits copies of the parent’s set of open file descriptors.”

Socket accept preforking works very well; popular projects that successfully employ this approach include GUnicorn, Tornado and Apache.

Worker threads process requests in the background while the socket thread can get back to waiting for events from the operating system. They usually listen on a queue from the main thread and receive the client socket’s file descriptor or also the incoming bytestream, process it and send the response back to the client. Developers need to be very careful with locking issues when accessing shared state/variables.

Naturally, worker threads are one of the best explored ways of distributing the work and having the main thread concentrate on the socket events.
Python
Python restricts multiple threads to one Python interpreter instance, thereby forcing multiple threads to share a single cpu. Context switches between the threads take place at latest after 100 Python instructions. Inside Python the controlling mechanism is called the Global Interpreter Lock (GIL) [1].

In a server that means as soon as one worker thread uses the cpu, it steals processing time from the socket handling thread, which makes a traditional worker thread architecture unfeasible. But there is an alternative: worker processes.

[Update: With this approach I was not able to cleanly close all open sockets. Be sure to check with lsof.]

Duplicating Sockets across Processes
Given the ability to pickle a socket, worker processes can do the same thing as worker threads: receive a client’s socket file descriptor and possibly the bytestream, process it, and send a response without requiring a callback to the main socket handler process.

Sockets can be pickled and reconstructed with an almost hidden, undocumented feature of Python’s multiprocessing module: multiprocessing.reduction.

multiprocessing.reduction
I discovered this module recently through an inline comment in one of the examples in the multiprocessing docs, and wanted to find out more:

$ python
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing.reduction
>>> help(multiprocessing.reduction)

Help on module multiprocessing.reduction in multiprocessing:

NAME
multiprocessing.reduction

FILE
/usr/lib/python2.6/multiprocessing/reduction.py

MODULE DOCS

http://docs.python.org/library/multiprocessing.reduction

DESCRIPTION
# Module to allow connection and socket objects to be transferred
# between processes
#
# multiprocessing/reduction.py
#
# Copyright (c) 2006-2008, R Oudkerk --- see COPYING.txt
#

Not much information, but at least we know where to look: multiprocessing/reduction.py

And there it is, starting at line 122:

#
# Functions to be used for pickling/unpickling objects with handles
#

def reduce_handle(handle):
if Popen.thread_is_spawning():
return (None, Popen.duplicate_for_child(handle), True)
dup_handle = duplicate(handle)
_cache.add(dup_handle)
sub_debug('reducing handle %d', handle)
return (_get_listener().address, dup_handle, False)

def rebuild_handle(pickled_data):
address, handle, inherited = pickled_data
if inherited:
return handle
sub_debug('rebuilding handle %d', handle)
conn = Client(address, authkey=current_process().authkey)
conn.send((handle, os.getpid()))
new_handle = recv_handle(conn)
conn.close()
return new_handle

Example Code
Putting it all together, this is how to use the above functions to share sockets with another process in Python:

# Main process
from multiprocessing.reduction import reduce_handle
h = reduce_handle(client_socket.fileno())
pipe_to_worker.send(h)

# Worker process
from multiprocessing.reduction import rebuild_handle
h = pipe.recv()
fd = rebuild_handle(h)
client_socket = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
client_socket.send("hello from the worker process\r\n")

References

Python, Threads and the Global Interpreter Lock (GIL)
The C10K Problem
Global Interpreter Lock

Notes
Thanks to Brian Jones for reading drafts of this post.

Recommended Reading
               
concurrency  python 
february 2011 by rygwdn
Python Package Index : pdbpp 0.6
This module is an extension of the pdb module of the standard library. It is meant to be fully compatible with its predecessor, yet it introduces a number of new features to make your debugging experience as nice as possible.
debugging  pdb  python 
january 2011 by rygwdn
Python Keyring Lib
The Python keyring lib provides a easy way to access the system keyring service from python. It can be used in any application that needs safe password storage. It supports OSX, KDE, Gnome and Windows's native password storing services. Besides this, it is shipped with kinds of Python implemented keyring for the left environments.
encryption  python  security 
january 2011 by rygwdn
Practical guidelines for beautiful Python code | TurnKey Linux Blog
Every now and then Liraz and I find ourselves chatting about how much we love Python, but more so the lessons we have learned from coding, and how to apply them to create beautiful Python code. I've tried to refine some of the "lessons" into practical guidelines that I apply religiously to all new code I write, and the refactoring of old code I written.

When reading other peoples code it sometimes ties my mind into knots, and on occasions I want to pull my hair out in frustration and disgust.  That's not to say I'm perfect, but hopefully these guidelines will benefit others (and indirectly help reduce my hair loss).

I couldn't possibly include everything I wanted to in one post, so this will be the first, and more will follow...
 

#1 - OO structure == Well defined mental concepts

Object Orientated structure should always map to well defined mental concepts in the problem domain. If you don't have a well defined mental model of the problem domain, start with that. Class Responsibility Collaboration (CRC) cards are really useful in this.

 

Basically you need a sketch of the architecture. What parts does your system have, what are their names, what does each part do? What parts is that part made out of? How do the parts interact with each other?

 

You can save quite a bit of effort if you come up with a good architecture up-front, but sometimes it may be easier to start without and figure it out a bit later after you have a bit more knowledge about the problem you are solving.

 

The rule is that the sooner you refine your architecture, the better. It is easy to dig yourself further and further into a complexity hole that makes restructuring very difficult later on. So do that as early as possible.

 

Refining the architecture is part of the "refactor mercilessly" rule.

 

#2 - Leverage built-in Python types

It is often a good idea to build on top of built-in Python types or at least emulate them.

 

The big advantage in building on top of Python's conventions is that you get to re-use Python's abstractions instead of reinventing your own.  Python is famous for its minimal elegance, and the language designers take great care making small beautiful data structures. Making your code structures more Pythonic is a pretty good way to leverage the built-in elegance of the language while saving yourself quite a bit of work (inventing good abstractions is hard!).

 

If you've never inherited from a built-in data type, experiment with a small test case in a throw away script. This way you don't mangle your current project and you can be as experimental as you like.

 

Tip: Construction can be a bit trickier than a normal object if your initialization interface is different from the built-in type. In that case you'll need to override the __new__ magic method as well to call the __new__ constructor of the base class with different arguments than your __new__ is called.

 

For example:

class IntField(int):
def __new__(cls, val, name):
return int.__new__(cls, val)

def __init__(self, val, name):
self.name = name
self.val = val

int.__init__(self, val)

 

As for emulating built in types, Python provides magical method names for emulating any built-in behavior. See special method names for reference.

 

#3 - Use the class namespace, Luke!

Use the class namespace when defining class-level attributes and the instance namespace (which inherits from the class namespace on initialization) when defining instance level attributes.

 

For example, constants are always class level attributes because they are shared by all instances. On a practical level there are two reasons to do this:


Enable inheritance - you can't override instance level attributes, only class level attributes.

Readability - code is communication. Setting an attribute at the class level or instance level is making a statement about the nature of that attribute, which makes the code easier to read and understand.
 

#4 - staticmethod vs. classmethod vs. regular method

Static methods are simpler and more readable than class methods because they don't have access to the class attribute, so it's easier to determine its input (i.e., arguments), and output (i.e., the return value).

 

Class methods are simpler than regular methods because the convention is that you don't usually manipulate attributes in the class namespace the way you would manipulate attributes in the instance namespace. So when you call a class method you know its not going to be sneaking around behind you back reading or writing to any instance level attributes set by other methods. The input for a class method is therefore the arguments it receives + class level constants. Its output is the return value.

 

A regular method is the most complex and easy to abuse type of method.  Its harder for the programmer to see what the inputs for the method are, and what its output is.

 

Lets consider the following case: imagine a class in which all private methods are called without arguments and do not return output values.  Instead, the private methods freely read and write to any attributes in self.

 

The problem with such a pattern, is that all methods have now become entangled in a spaghetti like dependency structure. The instance namespace in effect behaves like a global namespace and it becomes very difficult to understand methods in terms of input and output.  Furthermore, the boundaries between methods can easily become fuzzy and unclear.

 

You must never ever do this (I have in the passed, and I'm ashamed). If your code exhibits this pattern, go fix it, now!

 

Private methods have input arguments and return values for a reason. The instance namespace must only be used for instance level attributes which need to persist after a public method has been called.

 

Until it becomes second nature, these rules should help:


If a method doesn't need access to instance level attributes then it should be a class method, not a regular method.

If a method doesn't need access to class level attributes then it should be a static method, not a class method.
 

The Python Paradox

Paul Graham wrote an interesting essay entitled The Python Paradox. It's a quick read - take a look.
code  programming  python  style  tips 
january 2011 by rygwdn
Subgraph enumeration
This is part 1 of a series on subgraph enumeration.

Simple subgraph enumeration algorithm
Faster subgraph enumeration

The final code is available for download. However, bear in
mind that the code in part 2 is about 3.5 times faster.

Graph enumeration (not the point of this article)

How many molecules exist with N atoms? That has no easy answer. A
related graph theory question is "how many simple graphs exist with N
nodes" but those include chemically impossible solutions. This is a
well-studied field. If you want to go down that path, some pointers
are A000602 from the Online
Encyclopedia of Integer Sequences ("Number of n-node unrooted quartic
trees; number of n-carbon alkanes C(n)H(2n+2) ignoring
stereoisomers"), Brendan McKay's with with graph enumeration,
including canonicalization with nauty. (Downstream
implementations include N.I.C.E.
in Sage with specializations in
the canonicalization algorithm in InChI and in Indigo. A paper by
Hartke
and Radcliffe also explains the McKay algorithm.) For something
more chemistry specific, see Mick Kappler's GenSmi
project "to construct all possible compounds containing carbon,
nitrogen, and oxygen with a molecular weight between 150 and 250" and
Orion Jankowski's recent example code of simple
enumeration using Haskell.

It's a very interesting topic, but as far as I can tell it's mostly
theoretical. People conjecture about how enumeration might suggest new
drug-like candidates, but I've not heard of any success along those
lines. If you know of anything, please let me know.

Subgraph enumeration

Instead, I'm going to look at subgraph enumeration. That is, given a
structure, I'll enumerate all its unique subgraphs up to a given size,
and I'll count the number of times each unique subgraph exists.

My reason for doing this comes from my work with fingerprints. The
MACCS keys and the CACTVS substructure keys were made with human
insight into which patterns make good filters. I'm curious if I can
use clustering, a genetic algorithm, or some other means to identify a
set of substructures which are more effective at filtering. I have
access to the target data (eg, PubChem), and I suspect that knowledge
will be helpful.

Subgraph enumeration is not a new subject either, but I haven't come
across much work using it, at least not in chemistry. There are a
couple of papers by Gerta Rücker and Christoph Rücker from 2001:

On
Finding Nonisomorphic Connected Subgraphs and Distinct Molecular
Substructures

A computer program is introduced that first generates
all connected subgraphs and then uses a combination
of well-discriminating graph invariants to eliminate
duplicates.

and

Substructure,
Subgraph, and Walk Counts as Measures of the Complexity of Graphs
and Molecules

Construction of all substructures and subgraphs is
computationally demanding; therefore, two alternatives
are pointed out for the treatment of large sets of
compounds: (i) Often it will suffice to consider
counts of substructures/subgraphs up to a certain
number of edges only, information which is provided
by the program much more rapidly.

but I didn't actually read those papers. (Hmm, should walk up the hill
to the library.)

Initial subgraph enumeration algorithm

Graph algorithms can be tricky so I'm going to start with the simplest
algorithm I can think of. This is deliberately not a fast one - that
will be in my next essay. I'll develop this algorithm first then use
it to cross-check the faster but more complicated one.

I'll start with a collection of "seeds". Each seed is a graph, and
I'll grow the seeds using a bond to make a new seed, which I can use
for future growth.

seeds = an empty collection
For each atom in the molecule:
Mke a new seed with that atom and no bonds
Add the seed to the collection of seeds

While there are seeds:
Select and remove a seed from the collection
For each bond in the molecule:
If the bond is in the seed then:
Skip this bond
Else if the neither atom of the bond is in the seed then:
Skip this bond
Else if (one of the bond atoms is not in the seed and
the seed is at maximum size) then:
Skip this bond
Otherwise:
Create a new seed which is the old seed plus this new bond
Add the new seed to the collection

I think you can see that this will work. There will be duplicates
because if I start from "C#N" with the seeds "C" and "N" then I'll
grow the "C" to "C#N" and I'll grow the "N" to "N#C", but this is a
good first pass.

I know the OpenEye tools the best so I'll use OEChem to implement the
algorithm. There's nothing toolkit-specific about what I'm doing.

from openeye.oechem import *

class Seed(object):
def __init__(self, atoms, bonds):
self.atoms = atoms
self.bonds = bonds

# Add a new bond to the seed when both end atoms are already in the seed
def new_seed_with_bond(seed, bond):
#print "Add bond", bond.GetIdx()
assert bond not in seed.bonds
atoms = seed.atoms
bonds = seed.bonds.copy()
bonds.add(bond)
return Seed(atoms, bonds)

# Add a new atom to the seed, and the bond which connects it to the seed
def new_seed_with_atom_and_bond(seed, atom, bond):
#print "Add atom and bond", atom.GetIdx(), bond.GetIdx()
assert atom not in seed.atoms
assert bond not in seed.bonds
atoms = seed.atoms.copy()
bonds = seed.bonds.copy()
atoms.add(atom)
bonds.add(bond)
return Seed(atoms, bonds)

def print_seed(seed):
for atom in seed.atoms:
print atom.GetIdx(), OEGetAtomicSymbol(atom.GetAtomicNum())
for bond in seed.bonds:
if bond.IsAromatic():
bondtype = "~"
else:
bondtype = "X-=#$"[bond.GetOrder()]
print bond.GetBgnIdx(), bond.GetEndIdx(), bondtype, bond.GetIdx()

seed_count = 0
def report_seed(seed):
global seed_count
seed_count += 1
print "=== New seed (%d) ===" % seed_count
print_seed(seed)
print "==== end ==="


mol = OEGraphMol()
## Various test cases
#OEParseSmiles(mol, "SC=N")
#OEParseSmiles(mol, "SC=N.[Cl-]")
OEParseSmiles(mol, "S1C=N1")
#OEParseSmiles(mol, "c1ccccc1O")

# Maximum number of atoms to allow
k = 10
assert k >= 1, "need a bit more work to handle that case"

# Start with an empty collection
seeds = []

# Set up the initial set of seeds
for atom in mol.GetAtoms():
# Make a new seed with that atom and no bonds
seed = Seed(set([atom]), set())
report_seed(seed)
# Add the seed to the collection of seeds
seeds.append(seed)

# While there are seeds
while seeds:
# Select and remove a seed from the collection
seed = seeds.pop()

# For each bond in the molecule
for bond in mol.GetBonds():
# If the bond is in the seed then skip this bond
if bond in seed.bonds:
continue

new_atom = None
# If neither atom of the bond is in the seed then skip this bond
if bond.GetBgn() not in seed.atoms:
if bond.GetEnd() not in seed.atoms:
continue
new_atom = bond.GetBgn()
else:
if bond.GetEnd() not in seed.atoms:
new_atom = bond.GetEnd()

# If one of the bond atoms is not in the seed
# and the seed is at maximumize, then skip this bond
if new_atom is not None:
if len(seed.atoms) == k:
continue
# Create the new seed with this atom and bond
new_seed = new_seed_with_atom_and_bond(seed, new_atom, bond)
else:
# Create a new seed where the atoms are already in the
# seed but the bond is new
new_seed = new_seed_with_bond(seed, bond)

# Add the new seed to the collection
report_seed(new_seed)
seeds.append(new_seed)

I did purely manual testing to see if the algorithm worked. You can
see traces of my test cases in the different OEParseSmiles calls. I
did have bugs in my original code, like a report_seed(seed) instead of
report_seed(new_seed). My experience with these sorts of algorithms is
that the manual testing I did is good enough to catch the likely
errors. At this point I should turn the manual tests into real tests,
but I'm going to hold off on that until I have a better way to
represent the results.

The uncommented test case is "S1C=N1". I chose that because each atom
has a different symbol, one of the bonds is different than the others,
and because getting the cycle right is important. When I run the above
code I get 33 different subgraphs:

=== New seed (1) ===
0 S
==== end ===
=== New seed (2) ===
1 C
==== end ===
=== New seed (3) ===
2 N
==== end ===
=== New seed (4) ===
2 N
0 S
0 2 - 0
==== end ===
=== New seed (5) ===
2 N
1 C
1 2 = 2
==== end ===
=== New seed (6) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (7) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (8) ===
2 N
1 C
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (9) ===
2 N
1 C
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (10) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
==== end ===
=== New seed (11) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
==== end ===
=== New seed (12) ===
2 N
0 S
1 C
0 2 - 0
1 2 = 2
0 1 - 1
==== end ===
=== New seed (13) ===
2 N
0 S
1 C
0 2 - 0
0 1 - 1
1 2 = 2
==== end ===
=== New seed (14) ===
1 C
0 S
0 1 - 1
==== end ===
=== New seed (15) ===
1 C
2 N
1 2 = 2
==== end ===
=== New seed (16) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
==== end ===
=== New seed (17) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
==== end ===
=== New seed (18) ===
1 C
2 N
0 S
1 2 = 2
0 1 - 1
0 2 - 0
==== end ===
=== New seed (19) ===
1 C
2 N
0 S
1 2 = 2
0 2 - 0
0 1 - 1
==== end ===
=== New seed (20) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
==== end ===
=== New seed (21) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
==== end ===
=== New seed (22) ===
1 C
0 S
2 N
0 1 - 1
1 2 = 2
0 2 - 0
==== end ===
=== New seed (23) ===
1 C
0 S
2 N
0 1 - 1
0 2 - 0
1 2 = 2[…]
chemistry  graph  python  algorithms 
january 2011 by rygwdn
Sneak Preview of New Features in PyFilesystem 0.4
There have been some pretty exciting developments in PyFilesystem since version 0.3 was released – Ryan Kelly and myself have been hard at work, and there have been a number of excellent contributions to the code base from other developers. Version 0.4 will be released some time in January, but I'd like to give you a preview of some new features before the next version lands.

Pyfilesystem is a Python module that provides a simplified common interface to many types of filesystem.

It is now possible to open any of the supported filesystems from a URL in this format, which makes it very easy to specify a filesystem (or individual file) from the command line or a config file. Here's a quick example that opens a bunch of quite different filesystems:

from fs.opener import fsopendir
projects_fs = fsopendir('/projects')
zip_fs = fsopendir('zip:///foo/bar/baz.zip')
ftp_fs = fsopendir('ftp://ftp/mozilla.org')
sftp_fs = fsopendir('sftp://example.org')

If you used Pyfilesystem in your application you could trivially change where your files are physically located, or where you save generated files to.

You can also open a file directly without the need to explicitly open the filesystem it is contained within, with the fsopen function, e.g.:

from fs.opener import fsopen
print fsopen('zip:///foo/bar/baz.zip!dir/somefile.txt').read()
print fsopen('ftp://ftp.mozilla.org/pub/README').read()

fsopen is very similar to the builtin open method and will return a file-like object of some kind. In fact, if you pass in a system path it works as open would (although the exceptions will be an instance of fs.errors.FSError rather than IOError).

Because of this similarity with the builtin, fsopen could be used to shadow open, and instantly add the ability for an application to open files on mediums other than a system drive. This is all it takes:

from fs.opener import fsopen as open

FS Commands

Version 0.4 also adds a number of applications that mirror some of the standard command line apps, but extend their functionality with FS URLs. For example, fsls, functions just like the ls command, but works with any of the supported filesystems:

will@will-linux:~$ fsls .will@will-linux:~$ fsls zip://myzipwill@will-linux:~$ fsls ftp://ftp.mozilla.org/pubwill@will-linux:~$ fsls sftp://user:pass@securesite.com/home/will

You can also copy and move files between filesystems with fscp and fsmv, which work in a very similar manner to their cp and mv counterparts, with a few extensions such as multi-threaded copying (great for network filesystems) and an ascii progress bar. The following example copies all the .py files in my projects directory to zip file on an ftp server, and displays an progress bar to boot.

will@will-linux:~$ fscp ~/projects/*.py zip:ftp://will:password@example.org/backups/code.zip

Then there is the fscat command that writes a file in a filesystem to the terminal. The following example displays a python file in the zip file that we created with the previous command:

will@will-linux:~$ fscat zip:ftp://will:password@example.org/backups/code.zip!pythonrocks.py

The other commands fsmkdir, fsrm and fstree work as you may expect.

Serving

The fsserve command adds the ability to serve any of the supported filesystems over a number of protocols. The default is to serve http – in effect creating a webserver. The following is all it takes to serve the contents of your current working directory:

will@will-linux:~$ fsserve

Now if you point your browser at http://127.0.0.1 you will see a web-page with the contents of your current working directory (or a index.html file if present). It's not the most bullet-proof of web servers, but handy if you quickly want to serve some files. Naturally, fsserve works with any filesystem you pass to it. You could, for instance, serve the contents of a zip file without ever explicitly unzipping it, or create an ftp to http gateway by serving an ftp filesystem. The following command creates a ftp to http gateway for ftp.mozilla.org:

will@will-linux:~$ fsserve ftp://ftp.mozilla.org

You also have the option of serving a filesystem over SFTP (Secure FTP), or by RPC (Remote Procedure Call). Either of these two methods expose all the functionality of the remote filesystem, so you could run a server on one machine and create/move/copy/delete files from another machine on the network (or internet). For example, the following would serve the current working directory on localhost, port 3000:

will@will-linux:~$ fsserve -t rpc -p 3000

You can then connect to that server from another machine on the network. Assuming my local IP is 192.168.1.64 the following would display the directory structure from another machine on my network:

will@will-linux:~$ fstree rpc://192.168.1.64:3000

Mounting

Any of the filesystems can be mounted on the OS with the fsmount command, which uses FUSE on Linux or DOKAN on Windows. The advantage of this is that the filesystems exposed in Python can be used in any application, and browsed with Explorer or Nautilus. The following creates a ram drive on Linux:

will@will-linux:~$ fsmount mem:// mem

Or on Windows:

C:\> fsmount mem:// M

Get the Code

There is no documentation online for the new features as yet, but if you are a brave soul and want to experiment with any of the above commands then download the code from SVN and run python setup.py install. The command line apps all have a -h which which displays help on the various options.

Bear in mind that these commands are still somewhat experimental, and some of these commands have the capacity to delete files – so be careful. That said, I'm confident to use them for my day-to-day work.

Please see the projects page page if you want to report bugs or discuss Pyfilesystem with myself and the other developers.
python  fs  filesystem 
january 2011 by rygwdn
Trails in a Langscape » Blog Archive » Fuzzy string matching - Projects and projections
A damn hot algorithm

I found the following article written by Nick Johnson about the use of finite state machines for approximate  string matches i.e. string matches which are not exact but bound by a given edit distance. The algorithm is based on so called “Levenshtein automatons”. Those automatons are inspired by the construction of the Levenshtein matrix used for edit distance computations. For more information start reading the WP-article about the Levenshtein algorithm which provides sufficient background for Nicks article.

I downloaded the code from github and checked it out but was very stunned about the time it took for the automaton construction once strings grow big. It took almost 6 minutes on my 1.5 GHz notebook to construct the following Levenshtein automaton:

k = 6
S  = "".join(str(s) for s in range(10)*k)
lev = levenshtein_automata(S, k).to_dfa()

The algorithm is advertised as a “damn cool algorithm” by the author but since the major effect on my notebook was producing heat I wonder if “cool” shouldn’t be replaced by “hot”?

In the following article I’m constructing an approximate string matching algorithm from scratch.

Recursive rules for approximate string matching

Let ‘S be a string with len(S)=n and k a positive number with k<=n. By “?” we denote a wildcard symbol that matches any character including no character ( expressing a contraction ). Since S has length n we can select arbitrary k indexes in the set {0,…,n-1} and substitute the characters of S at those indexes using a wildcard symbol. If for example (S = “food” , k = 1 and index = 2) we get “fo?d”.

We describe the rule which describes all possible character substitutions in “food” like this:
pattern(food, 1) = ?ood | f?od | fo?d  | foo?
Applying successive left factorings yields:
pattern(food, 1) = ?ood | f  ( ?od | o (?d  | o? ) )
This inspires a recursive notation which roughly looks like this:
pattern(food, 1) = ?ood | f pattern(ood, 1)
or more precisely:
pattern(c, 1) = ?
pattern(S, 1) = ?S[1:] | S[0] pattern(S[1:], 1)
where we have used a string variable S instead of the concrete string “food”.

When using an arbitrary k instead of a fixed k = 1 we get the following recursive equations:
pattern(c, k) = ?
pattern(S, k) = ?pattern(S[1:], k-1) | S[0] pattern(S[1:], k)

Consuming or not consuming?

When we try to find an efficient implementation for the pattern function described above we need an interpretation of the ? wildcard action. It can consume a character and feed the rest of the string into a new call of pattern or skip a character and do the same with the rest. Since we cannot decide the choice for every string by default we eventually need backtracking but since we can memoize intermediate results we can also lower efforts. But step by step …

The basic idea when matching a string S1 against a string S2 is that we attempt to match S1[0] against S2[0] and when we succeed, we continue matching S[1:] against S2[1:] using the same constant k. If we fail, we have several choices depending on the interpretation of the wildcard action: it can consume a character of S2 or leave S2 as it is. Finally S1 and S2 are exchangeable, so we are left with the following choices:

fuzzy_compare(S1, S2[1:], k-1)
fuzzy_compare(S2, S1[1:], k-1)
fuzzy_compare(S1[1:], S2[1:], k-1)

All of those choices are valid and if one fails we need to check out another one. This is sufficient for starting a first implementation.

A first implementation

The following implementation is a good point to start with:

def fuzzy_compare(S1, S2, k, i=0, j=0):
N1 = len(S1)
N2 = len(S2)
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
i+=1
j+=1
continue
except IndexError:
return False
if k == 0:
return False
else:
if fuzzy_compare(S1, S2, k-1, i+1, j):
return True
if fuzzy_compare(S1, S2, k-1, i, j+1):
return True
if fuzzy_compare(S1, S2, k-1, i+1, j+1):
return True
return False

The algorithm employs full backtracking and it is also limited to medium sized strings ( in Python ) because of recursion. But it shows the central ideas and is simple.

A second implementation using memoization

Our second implementation still uses recursion but introduces a dictionary which records all (i,j) index pairs that have been encountered and stores the current value of k. If the algorithm finds a value k’ at (i,j) with k’<=k it will immediately return False because this particular trace has been unsuccessfully checked out before. Using ann x n matrix to memoize results will reduce the complexity of the algorithm which becomes O(n^2) where n is the length of the string. In fact it will be even O(n) because only a stripe of width 2k along the diagonal of the (i,j)-matrix is checked out. Of course the effort depends on the constant k.

def is_visited(i, j, k, visited):
m = visited.get((i,j),-1)
if k>m:
visited[(i,j)] = k
return False
return True
 
def fuzzy_compare(S1, S2, k, i = 0, j = 0, visited = None):
N1 = len(S1)
N2 = len(S2)
 if visited is None:
visited = {}
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
visited[(i,j)] = k
i+=1
j+=1
continue
except IndexError:
return False
if k == 0:
return False
else:
if not is_visited(i+1, j, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i+1, j, visited):
return True
if not is_visited(i, j+1, k-1, visited):
 if fuzzy_compare(S1, S2, k-1, i, j+1, visited):
return True
if not is_visited(i+1, j+1, k-1, visited):
if fuzzy_compare(S1, S2, k-1, i+1, j+1, visited):
return True
return False

A third implementation eliminating recursion

In our third and final implementation we eliminate the recursive call to fuzzy_compare and replace it using a stack containing tuples (i, j, k) recording the current state.

def is_visited(i, j, k, visited):
m = visited.get((i,j),-1)
if k>m:
visited[(i,j)] = k
return False
return True
 
def fuzzy_compare(S1, S2, k):
N1 = len(S1)
N2 = len(S2)
i = j = 0
visited = {}
stack = []
while True:
if N1-i<=k and N2-j<=k:
return True
try:
if S1[i] == S2[j]:
visited[(i,j)] = k
i+=1
j+=1
continue
except IndexError:
if stack:
i, j, k = stack.pop()
else:
return False
if k == 0:
if stack:
i, j, k = stack.pop()
else:
return False
else:
if not is_visited(i+1, j, k-1, visited):
stack.append((i,j,k))
i, j, k = i+1, j, k-1
elif not is_visited(i, j+1, k-1, visited):
stack.append((i,j,k))
i, j, k = i, j+1, k-1
elif not is_visited(i+1, j+1, k-1, visited):
i, j, k = i+1, j+1, k-1
elif stack:
i, j, k = stack.pop()
else:
return False

This is still a nice algorithm and it should be easy to translate it into C or into JavaScript for using it in your browser. Notice that the sequence of if … elif branches can impact performance of the algorithm. Do you see a way to improve it?

Checking the algorithm

When D is the Levenshtein distance between two strings S1 and S2 then fuzzy_compare(S1, S2, k) shall be True for k>=D and False otherwise. So when you want to test fuzzy_compare compute the Levenshtein distance and check fuzzy_compare with the boundary values k = D and k = D-1.

def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz] + 1)
return matrix[l2][l1]

For exhaustive testing we define a set of strings as follows:

Given a prescribed n we define the set of strings of length = n which consists of “a” and “b” characters only. The number of those strings is 2^n. If we consider all pairs of strings in that set we get 2^(2n) of such pairs. Of course we could exploit symmetries to remove redundant pairs but in order to keep it simple we examine only small strings e.g. n = 6 which leads to 4096 pairs altogether.

def string_set(S = None, k = 0, strings = None, n = 6):
if S is None:
strings = []
S = ["a"]*n
strings.append("".join(S))
for i in range(k, n):
S1 = S[:]
S1[i] = "b"
strings.append("".join(S1))
string_set(S1, i+1, strings, n)
return strings
 
def string_pairs(n):
L1 = string_set(n=n)
pairs = []
for i in range(len(L1)):
for k in range(1, n+1):
L2 = string_set(n=k)
for j in range(len(L2)):
pairs.append((L1[i],L2[j],levenshtein(L1[i], L2[j])))
pairs.append((L2[j],L1[i],levenshtein(L2[j], L1[i])))
return pairs

Our test function is s[…]
python  matching  strings  algorithms 
december 2010 by rygwdn
Mobile Web Development with Django | Imaginary Landscape Blog | Chicago Django and Python Web Development
If you've ever looked into creating your first Django project targeted at mobile devices, you were probably quick to realize that there is no be all, end all solution. Mobile development decisions have to be made with regards to handheld device detection, redirects, how to deal with desktop vs ...
django  mobile  python  development  article  from delicious
december 2010 by rygwdn
Finding C reference leaks using the gc module | Kristján's cosmic percolator
A while back I got a defect assigned to me complaining that clicking a button on our server backend web pages caused the server to freeze.  The link was on our “python” page, containing various tools and information on the python interpreter embedded in EVE.  The link itself was, interestingly enough, called “Leaky C++”.

Looking at the source code I saw code similar to this:

import gc
def get_leaking_objects_naive():
    all = gc.get_objects()
    result = []
    find = [all]
    for i in xrange(len(all)):
        if gc.get_referrers(all[i]) == find:
            result.append(all[i])
    return result
The idea is to find all objects that do not appear to be referenced by any other object and produce a report on those objects.  The problem with this code is, however, that it is O(N^2).   When there are sufficient objects in the system, this code takes forever to run.

I disabled this code and thought nothing more of it.  But recently, it occurred to me that the idea might not be too bad and whether there might be a better way to do this.  It turns out there is.  Instead of finding the referrers in this manner, we use another gc method, gc.get_referents() that returns objects immediately referred to by an object.  This is an O(1) operation and by repeately using it and eliminating objects that are such targets we can weed out everything that has referrers.  We then  end up with the list of objects that have no referrers, in one fell O(N) swoop:

def get_leaking_objects():
#create a dict of ids to objects
all = dict((id(i), i) for i in gc.get_objects())

#find all the objects that aren't referred to by any other object
ids = set(all.keys())
for i in all.values():
ids.difference_update(id(j) for j in gc.get_referents(i))

#this then is our set of objects without referrers
return [all[i] for i in ids]
This turns out to work surprisingly well.   Combined with a object hierarchy browser, this allows us to find suspicious objects, identify them and thus home in on the C code that may be causing trouble.

There is a caveat to this, and it is that gc.get_objects() and gc.get_referents() are documented to only return objects that can be part of a reference cycle.  So your leaking strings and integers won’t show up using this tool.

update:
I just made two improvements to the code.

It is faster and uses less memory to skip creating a dict out of the objects.
We must make sure not to leave cyclic references lying about.  the “all” variable contains the current function frame so unless we clear it, the frame and its contents (including “all”) isn’t released immediately.

def get_leaking_objects2():
all = gc.get_objects()
try:
ids = set(id(i) for i in all)
for i in all:
ids.difference_update(id(j) for j in gc.get_referents(i))
#this then is our set of objects without referrers
return [i for i in all if id(i) in ids]
finally:
all = i = j = None #clear cyclic references to frame
code  debugging  memory  python  leaks  from delicious
december 2010 by rygwdn
« earlier      

related tags

3d  addons  ai  ajax  algorithm  algorithms  amazon  animation  ansi  apache  api  appengine  apple  apps  article  ascii  async  audio  automation  badges  bash  bayesian  bazaar  bdd  benchmark  benchmarks  blog  blogs  book  books  boost  borg  bridge  browser  bugs  bugtracking  build  buildbot  buildout  bundle  business  bzr  c  c++  cache  cachegrind  calculator  calendar  charts  cheatsheet  cheesecake  chemistry  cherrypy  cli  cloud  cluster  cms  cocoa  code  codereview  coding  collaboration  color  commands  comparison  compiler  completion  complexity  computer  concurrency  configuration  console  conversion  convert  converter  cool  coroutines  cpp  cpu  crash  crawler  cron  css  ctypes  cucumber  cython  daemon  data  database  datastructures  date  db  dba  dbus  deb  debian  debug  debugger  debugging  deploy  deployment  design  designpatterns  desktop  development  diff  distributed  distribution  distutils  diy  django  documentation  dot  dotnet  dvcs  easy_install  editor  education  emacs  email  embedded  emulator  encryption  engineering  english  esky  excel  exe  exercises  extension  extensions  fabric  facebook  filesystem  firebug  firefox  flash  flask  flatblocks  format  formatting  forms  framework  free  fs  fun  functional  funny  fuse  futures  gallery  game  gamedev  games  gaming  gdb  geek  generator  genetic  genetic-algorithms  git  github  gnome  gnome-do  go  google  grammar  graph  graphics  graphs  graphviz  grid  gtd  gtk  gui  guide  hack  hacking  haml  hardware  haskell  hdl  hg  homebrew  hosting  howto  html  html5  hudson  humor  ical  icalendar  ide  image  images  input  inspiration  install  installation  interesting  interview  intro  ios  iphone  ironpython  itcup  java  javascript  jit  jquery  js  json  jython  kids  language  languages  latex  lazy  leaks  learn  learning  libraries  library  libxml2  lifestream  linux  list  llvm  loadtesting  localhost  logging  mac  machinelearning  make  management  mapreduce  maps  matching  math  measure  memory  mercurial  merge  microsoft  middleware  migration  mobile  mock  modeling  modules  monad  monitoring  multitouch  navigation  network  networking  networks  nose  notes  nvidia  odt  office  oneliners  online  open-source  opencl  opencv  opengl  openoffice  opensource  optimization  options  oracle  osx  package  packages  packaging  parallel  parser  parsing  password  passwords  path  patterns  paver  pdb  pdf  pep  performance  photo  php  pil  pip  plone  plugin  plugins  presentation  process  productivity  profiling  programming  proj-haml  proj-mysite  proj-tracey  project  promises  proxy  publishing  py2app  py2exe  pycon  pygtk  pylint  pylons  pypi  pypy  pyramid  pytest  python  qt  queue  quickly  rails  rally  ram  random  rant  reader  realtime  refactoring  reference  regexp  registration  remote  repository  repoze  research  resources  rest  review  rss  rst  rst2pdf  rsync  ruby  s3  sage  sanitize  school  science  scipy  scm  scraping  screen  screencast  script  search  security  selenium  server  setup  shell  shipping  shuffle  signals  simulation  singleton  site  skeleton  snippets  social  software  sorting  sound  source  spider  spreadsheet  sql  sqlalchemy  sqlite  ssh  stackless  starter  statistics  stats  storage  strftime  strings  study  style  success  sysadmin  tablet  tabs  tagging  tdd  tech  technology  template  templates  templating  terminal  test  testing  tex  text  theme  themes  thread  threading  time  tips  todo  tomboy  toodledo  tool  tools  toread  tutorial  tutorials  twisted  typing  typography  ubuntu  ui  uml  unittesting  unix  unladen  update  updates  valgrind  vcs  vector  versioncontrol  versioning  vi  video  videos  vim  virtualenv  visualization  wacom  weave  web  webcam  webdesign  webdev  webhosting  webserver  wii  wiki  window  windowmanager  windows  wine  word  wordpress  work  writing  wsgi  x11  xls  xml  xpath  xslt  zsh 

Copy this bookmark:



description:


tags: