Python Native Interface

Foreword : If you wish to make comments on this proposal, please do so at http://jpype.blogspot.com/2006/09/pep-proposal-get-rid-of.html

It's been said before. For the average user, finding and building Python extensions, especially on windows, is a nightmare. Even for the experienced extension author, distributing extension can be a huge headache.

Why is that? The biggest problem in my experience is the fact that each extension needs to be compiled for a specific version of Python, using the same compiler used to compile this version of python. Not so much of a problem on Unix, where adequate compilers are widely and freely available. Major headache on Windows though. An Extension author starting today, who want to target all recent versions of python (say, 2.3, 2.4 and now 2.5), must somehow get thair hands on 2 version of the Microsoft compilers that are not available anymore. Even the lastest version depends on a now unavailable version of the tools. The alternatives of using free/alternate compilers are error prone and not quite correctly supported by disutils.

The solution

The solution is simple : We must break the dependency of Python extensions on Python versions and compiler. Fortunately the solution is pretty simple.

I will now describe a solution to this problem in many steps. Steps 1 and 2 are enough to achieve complete version/compiler independence. Further steps take us farther into making the Python C extension interface more independent and easier to use. While such enhancements would not be possible with current C-API because of backward-compatility concerns, the break that represents this new interface is a good time to consider them.

Finally, before I get into specific, this new interface can be implemented alongside the old one. It would be better to try and phase out the older API, if only because we would not want to maintain 2 APIs.

Step 1 : Method table

The reason current extensions are compiler and python version dependent, is because they are tied to the Python DLL. By directly using the functions exported from the Python runtime, they must then share it's c library and other internal structures.

Breaking this dependency involves pushing the python functions to the Extension at the moment it gets loaded. It this sounds outlandish, consider that this is exactly what the Java Native Interface does.

From this point on, the only interface between Python and the extensions is this C-compliant method table. No need to link the resulting extension against Python import lib. Thus any compiler that can generate DLL can be used.

Every native method receives this function table as part of its parameters, including the module initialization function.

This Method table is not a completely new API. Rather it is simply repackaging the current API in a new way. As such it should be possible to create an extension for previous version of python that would act as a bridge between classic python and PNI-style extensions.

What we gain

  • Python version and compiler independence for those extensions that do not define extension types.

What we lose

  • A few macros. Some of the current API is implemented as macros that call other methods in the api. Since those macros could not now implicitly find the method table, they would need to be replaced with real functions. ALl the macros that simply manipulate internal fields can stay.

Migration path

While this proposal suggest to keep the old API around (at least for a while), we still need a clear migration solution for those author who want to benefit from the new capabilities.

Since there is a one-to-one correspondance between old and new API, it should be relatively easy to write a python script that parses C/C++ sources and replaces the old method with calls through the new function table.

Extensions that have deep call hierarchies and access python API from deep in the hierarchy face a tougher problem. A stopgap solution might be to store the function table received at module initialization time in a global variable accessible from anywhere. A bit more analysis is required here

API generation tools can easily be modified to generate code compatible with the new API.

Step 2 : Opaque python data types

A further step in the runtime independence is making the Python types (like PyObject) opaque data types that can only be manipulated throught he Python API.

The only drawback to this is for those extensions that define extension types. It should still be possible to define an extension type completely in the C code. However that process would be radically different and a pain to migrate. Another solution might be to make it easier to define the type in Python and make only parts of it native.

Again if this looks like a radical idea, look at Java. It JNI interface has allowed the Java runtime to change significantly without requiring extensions to change, or even be recompiled.

What we gain

  • Python version and compiler independence even for extensions that do define extension types.
  • Freedom to enhance the runtime data structures without impacting existing

What we lose

  • Simplicity of defining extension types. Although with further analysis, I believe we could come up with a way even simpler than what we have today.

Migration Path

There is no easy migration path for this part. Extensions that define extension types will require major rework in order to work with the PNI-style api.

Once a solution is decided, we may again look into a script that would at least do part of the work.

Step 3 : Normalizing the API

So far I've only proposed repackaging the current API. Now I'm going to propose we change it.

This is no longer about extension independence. It's about making writing extensions easier and less error prone.

There are many warts in the current API. What do some functions steal references to their parameter, and some don't? Why do some functions return new references and other borrowed ones? Why have 5 flavors of Length, when PyObject_Length shoudl work on all cases?

These a just examples off the top of my point. Point is, those subtle gotchas make writing extensions more difficult than they need to be.

The reasons for the many version of a function may be valid, but then they need to be clearly documented. Maybe calling PyList_Length is faster than PyObject_Length when you;re sure you have a PyList. If so, it should be noted in the docs.

Also, maybe there are good reasons for some functions to behave differently in regards to references. While this difference in behavior is clearly documented, I believe a good naming convention in the method name would allow someone to read code and not have to refer to the docs every minute. Better yet, if possible, the behavior should be the same for every method.

What we gain

  • It becomes a lot easier to program the C API. Fewer exceptions leads to fewer memory problems, resulting in more stable extensions.

What we Lose

  • We don't lose anything per see. COnverting to the new API could be a bitch though.

Migration path

Migration cannot be easy. While we can still use the migration solutions proposed in steps 1 and 2, any function whose behavior has changed would need to be changed manually, to make sure code around it that relies on the old behavior would get changed too.

Step 4 : Eliminating argument parsing and result formating

One this you often see when people talk about python speeds is "write it in python. Then rewite the parts that need it (if any) in C".

Fair enough. Except that writing the C part is anything but simple. While nothing much can be done about the memory management aspect, argument parsing is an area that could easily improve.

What if we specified the signature of a method when we add it to the method table? What if the the Python runtime would take care of argument parsing and call a method that already has the right signature?

At the risk of repeating myself, this is exactly what Java's JNI does.

This could be done reasonably efficiently using ffi (which is part of ctypes, already folded in the main runtime). This would certainly take core of some of the hurdles.

For the few cases where the callee need to do it's own parsing, because the regular PyArg_ParseTuple format is not descriptive enough, then we can easily allow this.

What we gain

  • An easier way for casual authors to create extension methods.

What we lose

  • Since we keep the possibility of the old method signature around, we loose nothing.

Migration part

This path can be as needed. Modules that are already stable need not need it. New development cn take advantage of it if and when their author wants to.

Note that this functionality is not hard to make available as a kind of code generator. Making it part of the core though would certainly ensure wider acceptance.