Category Archives: Python

Compilers, Runtimes, and Users – Oh my!

I’ve been meaning to respond to this since last week, but was totally caught up with SciPy and a product launch. Better late than never.

This is a response to Alex Gaynor’s response  to my previous post, “Does the Compiler Know Best?”

I plan to respond to some of the more specific points from Alex in the comment stream of his blog post itself. However, I am writing a new post here to reiterate the broader picture of the forest which I think is being missed for the trees. Specifically: I wrote the blog post in response to the disturbing rise in comments of the form “Why shouldn’t PyPy be the official implementation of Python?” I think part of Alex’s (and other PyPy devs’) confusion or consternation at the post is due to my failure to more clearly identify the intended audience. The post was meant as a response to those voices in the peanut gallery of the blogo/twitter-sphere, and was not meant as a harangue against the PyPy community.

However, I do stand firmly by my core point: a huge part of Python’s success is due to how well its runtime plays with existing codebases and libraries, and it would be utterly foolish to abandon that trump card to join in the “who has the fastest VM” race.

This is wrong. Complete and utterly. PyPy is in fact not a change to Python at all, PyPy faithfully implements the Python language as described by the Python language reference, and as verified by the test suite.

There is certainly no argument that PyPy is a faithful implementation of the Python official language standard. It passes the test suites! However, do the test suites reflect the use cases out in the wild? Or do those not matter? On the basis of the many years of consulting I have done in applying Python to solving real problems at companies with dozens, hundreds, or even thousands of Python programmers, my perspective is that CPython-as-a-runtime is almost as important as Python-as-a-language. They have co-evolved and symbiotically contributed to each other’s success.

It is not my place, nor is it my intent, to tell people what they can and cannot implement. My one goal is to voice a dissenting opinion when I feel the ecosystem is presented with proposals which I think are either short-sighted or damaging. My previous post was in response to the clamor from those who have only been exposed to Python-the-language, and who only have visibility to those surface layer use cases.

Second, he writes, “What is the core precept of PyPy? It’s that “the compiler knows best”.” This too, is wrong. First, PyPy’s central thesis is, “any task repeatedly performed manually will be done incorrectly”, this is why we have things like automatic insertion of the garbage collector, in preference to CPython’s “reference counting everywhere”, and automatically generating the just in time compiler from the interpreter…

Interestingly enough, CPython’s reference-counting based garbage collection scheme is sometimes cited as one of the things that makes it integrate more nicely with external code. (It might not exactly be easy to track all the INCREFs and DECREFs, but the end result is more stable and easier to deal with.) And there is no problem with auto-generating the JIT compiler, except that (as far as I know) there is not a well-defined API into it, so that external code can interoperate with the compiler and the code it’s running.

It would appear that there is a one way funnel from a user’s Python source and the RPython interpreter through the PyPy machinery, to emit a JITting compiler at the other end. This is fine if the center and the bulk of all the action is in the Python source. However, for a huge number of Python users, this is simply not the case. For those users, having an opaque “Voila! Here’s some fast machine code!” compiler pipeline is not nearly as useful as a pipeline whose individual components they can control.

And that is the primary difference between a compiler and an interpreter. The interpreter has well-defined state that can be inspected and modified as it processes a program. A compiler has a single-minded goal of producing optimized code for a target language. Of course, the PyPy project has a compiler and an interpreter, but the generated runtime is not nearly as easy to integrate and embed as CPython. I will say it again: CPython-the-runtime is almost as important as Python-the-language in contributing to the success of the Python ecosystem.

Second, the PyPy developers would never argue that the compiler knows best, … having an intelligent compiler does not prohibit giving the user more control, in fact it’s a necessity! There are no pure-python hints that you can give to CPython to improve performance, but these can easily be added with PyPy’s JIT.

Quick quiz: How long has Python had static typing?
Answer: Since 1995, when Jim Hugunin & others wrote Numeric.

Numeric/Numarray/Numpy have been the longest-lived and most popular static typing system for Python, even though they are generally only used by people who wanted to statically type millions or billions of memory locations all at once. CPython made it easy to extend the interpreter so that variables and objects in the runtime were like icebergs, with a huge amount of sub-surface mass. The perspective of CPython-the-runtime has been one of “extensible & embeddable”.

Does it matter that these extensions were “impure”? Did that hurt or help the Python ecosystem?

At the end of the day, it comes down to what users want. In the last 5 years, since Python has successfully regained mindshare in the web development space that was lost to RoR, there have been a large number of relative new-comers to the ecosystem whose needs are much more focused on latency and I/O throughput, and for whom extensions modules are an afterthought. For these users, there is a single-minded focus on raw VM performance; the closest they will ever get to an extension module is maybe a DB driver or some XML/text processing library or an async library. It’s understandable that such people do not understand what all the fuss is, and why they might vocally push for PyPy to replace CPython as the reference implementation.

My experiences with Python and the users that I’ve been exposed to are a much different crowd. I don’t think they are any fewer in number; however, they are usually working at large companies and banks or military and government, and generally do not tweet and blog about their technology. I have sat in numerous corporate IT discussions where Python has been stacked up against Java, .Net, and the like – and in all of these things, I can assure you that the extensibility of the CPython interpreter (and by extension, the available of libraries like NumPy) have been major points in our favor. In these environments, Python does not exist in a vacuum, nor is it even at the top of the foodchain. Its adoption is usually due to how well it plays with others, both as an extensible and as an embeddable VM.

The reason I felt like I needed to write a “dissenting” blog post is because I know that for every comment or Reddit/HN comment, there are dozens or hundreds of other people who are watching the discussions and debate, and not able to (or willing to) comment, but who have internal corporate discussions about technology direction. If Python-the-community were to deprecate CPython-the-runtime, those internal discussions would head south very quickly.

My post was directed specifically at people who make comments about what Python should or should not do, without having good insight into what its user base (both individual and corporate) look like. The subtext of my post is that “there are voices in the Python ecosystem who understand traditional business computing needs”, and the audience of that subtext is all the lurkers and watchers on the sidelines who are constantly evaluating or defending their choice of language and technologies.

Advertisements

Does the compiler know best?

Ted Dziuba recently blogged about Python3’s Marketing Problem. I chimed in on the comment thread, but there was a deeper point that I felt is missed in the discussions about the GIL and PyPy and performance.  Lately I’ve seen more and more people expressing sentiments along the lines of:

I’m of the same mind, but think that instead of offering a GIL fix, the goodie should have been switching over to PyPy. That would have sold even more people on it than GIL removal, I think.

I know it is an unpopular opinion, but somebody’s got to say it: PyPy is an even more drastic change to the Python language than Python3. It’s not even a silver bullet for performance. I believe that its core principles are, in fact, antithetical to the very things that have brought Python its current success. This is not to say that it’s not an interesting project. But I really, really feel that there needs to be a visible counter to the meme that “PyPy is the future of Python performance”.

What is the core precept of PyPy? It’s that “the compiler knows best”. Whether it’s JIT hotspot optimization, or using STM to manage concurrency, the application writer, in principle, should not have to be bothered with mundane details like how the computer actually executes instructions, or which instructions it’s executing, or how memory is accessed. The compiler knows best.

Conversely, one of the core strengths of Python has been that it talks to everybody, because its inner workings are so simple. Not only is it used heavily by folks of all stripes to integrate legacy libraries, but it’s also very popular as an embedded scripting system in a great number of applications. It is starting to dominate on the backend and the front-end in the computer graphics industry, and hedge funds are starting to converge on it as the best language to layer on top of their low-level finance libraries.

If you doubt that transparency is a major feature, you simply have to look at the amount of hand-wringing that JVM folks do about “being hit by the GC” to understand that there, but by the grace of Guido, go we. If we have to give up
ease of embedding and interoperability, and visibility into what the running system is doing, for a little improvement in performance, then the cost is too steep.

It’s understandable that those who see Python as merely a runtime for some web app request handlers will have a singular fixation with “automagically” getting more performance (JIT) and concurrency (STM) from their runtime. I never thought I’d say this, but… for those things, just fucking use Node.js. Build a Python-to-JS cross compiler and use a runtime that was designed to be concurrent, sandboxed, lightweight, and has the full force of Google, Mozilla, Apple, and MSFT behind optimizing its performance across all hardware types. (It would not surprise me one bit if V8+NaCl finally became what the CLR/DLR could have been.) Armin and the PyPY team are incredibly talented, and I think Nick is probably right when he says that nobody has more insight and experience with optimizing Python execution than Armin.

But even Armin has essentially conceded that optimizing Python really requires optimization at a lower level, which is why PyPy is a meta-tracing JIT. However, PyPy has made the irreversible architectural decision that that level should be merely an opaque implementation detail; the compiler knows best.

An alternative view is that language runtimes should be layered, but always transparent.

Given the recent massive increase of commercial investment in LLVM, and the existence of tools in that ecosystem like DragonEgg, syntax really ceases to be a lock-in feature of a language. (Yes, I know that sounds counter-intuitive.) Instead, what matters more is a runtime’s ability to play nicely with others, and of course its stable of libraries which idiomatically use that runtime. Python could be that runtime. Its standard library could become the equivalent of a dynamic language libc.

Python gained popularity in its first decade because it was a non-write-only Perl, and it worked well with C. It exploded in popularity in its second decade because it was more portable than Java, and because the AMD-Intel led to spectacular improvements in CPU performance, so that an interpreted language was fast enough for most things. For Python to emerge from its third decade as the dynamic language of choice, its core developers and the wider developer community/family will have to make wise, pragmatic choices about what the core strengths of Python are, and what things are best left to others.

View in this light, stressing Unicode over mere performance is a very justifiable decision that will yield far-reaching, long term returns for the language. (FWIW, this is also why I keep trolling Guido about better DSL support in Python; “playing nicely with others” in a post-LLVM world means syntax interop, as well.)

The good news is that the python core developers have been consistently great at making pragmatic choices. One new challenge is that the blogosphere/twittersphere has a logic unto itself, and can lead to very distracting, low signal-to-noise ratio firestorms over nothing. (fib(), anyone?) Will Python survive the noise- and gossip-mill of the modern software bazaar? Only time will tell…

Tagged

Contortions (and eventual success) with pydistutils.cfg

I finally upgraded to Snow Leopard (OS X 10.6) this past weekend, and the first order of business was to get Python configured the way I wanted.  I had previously been using a very custom install based on the old “Intel Mac Python 2.5” notes that Robert Kern wrote up for ETS developers/users, and I had resolved to be more intentional and organized about how I managed the installation of Python packages on my new system.

So, I first installed EPD, the Enthought Python Distribution as a base. Then I created a ~/.pydistutils.cfg file with the contents as outlined in the Python docs on Installing Python Packages:

[install]
install-base=$HOME/Library/Python2.6
install-purelib=site-packages
install-platlib=plat-mac
install-scripts=scripts
install-data=data

I then tried to install Mercurial using the one-liner:


$ easy_install mercurial

And I was promptly greeted with the error:


error: install-base or install-platbase supplied, but installation scheme is incomplete

WTF?

Google turned up nothing of substance, save for a link to an old subversion commit of distutils/commands/install.py. Taking this as a sign, I opened up my local copy of the file and a brief code read revealed the source of the problem: I was missing the install-headers option. So, I added the line:

    install-headers=Include

And was greeted by a different error:

install_dir site-packages/
TEST FAILED: site-packages/ does NOT support .pth files
error: bad install directory or PYTHONPATH

You are attempting to install a package to a directory that is not
on PYTHONPATH and which Python does not read ".pth" files from.  The
installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    site-packages/

and your PYTHONPATH environment variable currently contains:

    '/Users/pwang/Library/Python2.6/site-packages:/Users/pwang/Library/Python2.6/plat-mac:'

Well, this was most disheartening. I was, after all, following the Python docs, which seem to imply that install-purelib would be appended to install-base. The above error message suggests that this was not the case, so I went back to the distutils source, and more code reading and tracing seemed to confirm this. So, I added an explicit $base to all of the config lines in my pydistutils.cfg, with a final result that looked like this:

[install]
install-base=$HOME/Library/Python2.6
install-purelib=$base/site-packages
install-platlib=$base/plat-mac
install-headers=$base/Include
install-scripts=$base/scripts
install-data=$base/data

This, finally, seemed to work. easy_install mercurial worked great, and everything installed into the proper locations. One thing to note was that the $base variable in pydistutils.cfg needs to be lower case.

Hopefully this entry will turn up the next time someone searches for “install-base or install-platbase supplied, but installation scheme is incomplete” and they are spared having to dig through the distutils source.