Category Archives: Computing

OS X tarballs: a saga of extended attributes

Are you a Mac user? Creating a tarball? Beware!

Today I ran into a problem where a tar file that I created had a bunch of ._ files:

$ tar tvf ../bokeh-0.2.tgz
drwxr-xr-x 0 pwang staff 0 Oct 23 15:12 bokeh-0.2/
-rw-r--r-- 0 pwang staff 226 Oct 23 15:04 bokeh-0.2/._.gitattributes
-rw-r--r-- 0 pwang staff 31 Oct 23 15:04 bokeh-0.2/.gitattributes
-rw-r--r-- 0 pwang staff 226 Oct 23 15:04 bokeh-0.2/._.gitignore
-rw-r--r-- 0 pwang staff 1047 Oct 23 15:04 bokeh-0.2/.gitignore
-rwxr-xr-x 0 pwang staff 226 Oct 23 15:04 bokeh-0.2/._bokeh
drwxr-xr-x 0 pwang staff 0 Oct 23 15:04 bokeh-0.2/bokeh/
-rwxr-xr-x 0 pwang staff 226 Oct 23 15:04 bokeh-0.2/._bokeh-server
-rwxr-xr-x 0 pwang staff 1681 Oct 23 15:04 bokeh-0.2/bokeh-server
-rw-r--r-- 0 pwang staff 226 Oct 23 15:04 bokeh-0.2/._CHANGELOG
-rw-r--r-- 0 pwang staff 1388 Oct 23 15:04 bokeh-0.2/CHANGELOG
...

When I looked in the source directory which I was tarring up, I see no such files. What gives?

It turns out that OS X’s version of tar will automatically create these ._ versions for each file that has extended properties and attributes, because it does not want that information lost. Extended attributes are an aspect of the HFS+ filesystem which OS X uses. If you ever do an “ls -l” and see an “@” symbol near the permissions on your files, that indicates that it has extended attributes:

$ ls -l
total 128
-rw-r--r--@ 1 pwang staff 1388 Oct 23 15:04 CHANGELOG
-rw-r--r--@ 1 pwang staff 2587 Oct 23 15:04 QUICKSTART.md

To list these attributes, you do “ls -@”:

$ ls -@l
total 128
-rw-r--r--@ 1 pwang staff 1388 Oct 23 15:04 CHANGELOG
com.apple.quarantine 71
-rw-r--r--@ 1 pwang staff 2587 Oct 23 15:04 QUICKSTART.md
com.apple.quarantine 71

To strip these off, you use the xattr command:

$ xattr -d com.apple.quarantine CHANGELOG

To strip the attributes off of all files in a directory tree, use find:

$ find . -xattr -exec xattr -d com.apple.quarantine {} \;

Now you can create a tarball that won’t be offensive to your friends!

Compilers, Runtimes, and Users – Oh my!

I’ve been meaning to respond to this since last week, but was totally caught up with SciPy and a product launch. Better late than never.

This is a response to Alex Gaynor’s response  to my previous post, “Does the Compiler Know Best?”

I plan to respond to some of the more specific points from Alex in the comment stream of his blog post itself. However, I am writing a new post here to reiterate the broader picture of the forest which I think is being missed for the trees. Specifically: I wrote the blog post in response to the disturbing rise in comments of the form “Why shouldn’t PyPy be the official implementation of Python?” I think part of Alex’s (and other PyPy devs’) confusion or consternation at the post is due to my failure to more clearly identify the intended audience. The post was meant as a response to those voices in the peanut gallery of the blogo/twitter-sphere, and was not meant as a harangue against the PyPy community.

However, I do stand firmly by my core point: a huge part of Python’s success is due to how well its runtime plays with existing codebases and libraries, and it would be utterly foolish to abandon that trump card to join in the “who has the fastest VM” race.

This is wrong. Complete and utterly. PyPy is in fact not a change to Python at all, PyPy faithfully implements the Python language as described by the Python language reference, and as verified by the test suite.

There is certainly no argument that PyPy is a faithful implementation of the Python official language standard. It passes the test suites! However, do the test suites reflect the use cases out in the wild? Or do those not matter? On the basis of the many years of consulting I have done in applying Python to solving real problems at companies with dozens, hundreds, or even thousands of Python programmers, my perspective is that CPython-as-a-runtime is almost as important as Python-as-a-language. They have co-evolved and symbiotically contributed to each other’s success.

It is not my place, nor is it my intent, to tell people what they can and cannot implement. My one goal is to voice a dissenting opinion when I feel the ecosystem is presented with proposals which I think are either short-sighted or damaging. My previous post was in response to the clamor from those who have only been exposed to Python-the-language, and who only have visibility to those surface layer use cases.

Second, he writes, “What is the core precept of PyPy? It’s that “the compiler knows best”.” This too, is wrong. First, PyPy’s central thesis is, “any task repeatedly performed manually will be done incorrectly”, this is why we have things like automatic insertion of the garbage collector, in preference to CPython’s “reference counting everywhere”, and automatically generating the just in time compiler from the interpreter…

Interestingly enough, CPython’s reference-counting based garbage collection scheme is sometimes cited as one of the things that makes it integrate more nicely with external code. (It might not exactly be easy to track all the INCREFs and DECREFs, but the end result is more stable and easier to deal with.) And there is no problem with auto-generating the JIT compiler, except that (as far as I know) there is not a well-defined API into it, so that external code can interoperate with the compiler and the code it’s running.

It would appear that there is a one way funnel from a user’s Python source and the RPython interpreter through the PyPy machinery, to emit a JITting compiler at the other end. This is fine if the center and the bulk of all the action is in the Python source. However, for a huge number of Python users, this is simply not the case. For those users, having an opaque “Voila! Here’s some fast machine code!” compiler pipeline is not nearly as useful as a pipeline whose individual components they can control.

And that is the primary difference between a compiler and an interpreter. The interpreter has well-defined state that can be inspected and modified as it processes a program. A compiler has a single-minded goal of producing optimized code for a target language. Of course, the PyPy project has a compiler and an interpreter, but the generated runtime is not nearly as easy to integrate and embed as CPython. I will say it again: CPython-the-runtime is almost as important as Python-the-language in contributing to the success of the Python ecosystem.

Second, the PyPy developers would never argue that the compiler knows best, … having an intelligent compiler does not prohibit giving the user more control, in fact it’s a necessity! There are no pure-python hints that you can give to CPython to improve performance, but these can easily be added with PyPy’s JIT.

Quick quiz: How long has Python had static typing?
Answer: Since 1995, when Jim Hugunin & others wrote Numeric.

Numeric/Numarray/Numpy have been the longest-lived and most popular static typing system for Python, even though they are generally only used by people who wanted to statically type millions or billions of memory locations all at once. CPython made it easy to extend the interpreter so that variables and objects in the runtime were like icebergs, with a huge amount of sub-surface mass. The perspective of CPython-the-runtime has been one of “extensible & embeddable”.

Does it matter that these extensions were “impure”? Did that hurt or help the Python ecosystem?

At the end of the day, it comes down to what users want. In the last 5 years, since Python has successfully regained mindshare in the web development space that was lost to RoR, there have been a large number of relative new-comers to the ecosystem whose needs are much more focused on latency and I/O throughput, and for whom extensions modules are an afterthought. For these users, there is a single-minded focus on raw VM performance; the closest they will ever get to an extension module is maybe a DB driver or some XML/text processing library or an async library. It’s understandable that such people do not understand what all the fuss is, and why they might vocally push for PyPy to replace CPython as the reference implementation.

My experiences with Python and the users that I’ve been exposed to are a much different crowd. I don’t think they are any fewer in number; however, they are usually working at large companies and banks or military and government, and generally do not tweet and blog about their technology. I have sat in numerous corporate IT discussions where Python has been stacked up against Java, .Net, and the like – and in all of these things, I can assure you that the extensibility of the CPython interpreter (and by extension, the available of libraries like NumPy) have been major points in our favor. In these environments, Python does not exist in a vacuum, nor is it even at the top of the foodchain. Its adoption is usually due to how well it plays with others, both as an extensible and as an embeddable VM.

The reason I felt like I needed to write a “dissenting” blog post is because I know that for every comment or Reddit/HN comment, there are dozens or hundreds of other people who are watching the discussions and debate, and not able to (or willing to) comment, but who have internal corporate discussions about technology direction. If Python-the-community were to deprecate CPython-the-runtime, those internal discussions would head south very quickly.

My post was directed specifically at people who make comments about what Python should or should not do, without having good insight into what its user base (both individual and corporate) look like. The subtext of my post is that “there are voices in the Python ecosystem who understand traditional business computing needs”, and the audience of that subtext is all the lurkers and watchers on the sidelines who are constantly evaluating or defending their choice of language and technologies.

Does the compiler know best?

Ted Dziuba recently blogged about Python3’s Marketing Problem. I chimed in on the comment thread, but there was a deeper point that I felt is missed in the discussions about the GIL and PyPy and performance.  Lately I’ve seen more and more people expressing sentiments along the lines of:

I’m of the same mind, but think that instead of offering a GIL fix, the goodie should have been switching over to PyPy. That would have sold even more people on it than GIL removal, I think.

I know it is an unpopular opinion, but somebody’s got to say it: PyPy is an even more drastic change to the Python language than Python3. It’s not even a silver bullet for performance. I believe that its core principles are, in fact, antithetical to the very things that have brought Python its current success. This is not to say that it’s not an interesting project. But I really, really feel that there needs to be a visible counter to the meme that “PyPy is the future of Python performance”.

What is the core precept of PyPy? It’s that “the compiler knows best”. Whether it’s JIT hotspot optimization, or using STM to manage concurrency, the application writer, in principle, should not have to be bothered with mundane details like how the computer actually executes instructions, or which instructions it’s executing, or how memory is accessed. The compiler knows best.

Conversely, one of the core strengths of Python has been that it talks to everybody, because its inner workings are so simple. Not only is it used heavily by folks of all stripes to integrate legacy libraries, but it’s also very popular as an embedded scripting system in a great number of applications. It is starting to dominate on the backend and the front-end in the computer graphics industry, and hedge funds are starting to converge on it as the best language to layer on top of their low-level finance libraries.

If you doubt that transparency is a major feature, you simply have to look at the amount of hand-wringing that JVM folks do about “being hit by the GC” to understand that there, but by the grace of Guido, go we. If we have to give up
ease of embedding and interoperability, and visibility into what the running system is doing, for a little improvement in performance, then the cost is too steep.

It’s understandable that those who see Python as merely a runtime for some web app request handlers will have a singular fixation with “automagically” getting more performance (JIT) and concurrency (STM) from their runtime. I never thought I’d say this, but… for those things, just fucking use Node.js. Build a Python-to-JS cross compiler and use a runtime that was designed to be concurrent, sandboxed, lightweight, and has the full force of Google, Mozilla, Apple, and MSFT behind optimizing its performance across all hardware types. (It would not surprise me one bit if V8+NaCl finally became what the CLR/DLR could have been.) Armin and the PyPY team are incredibly talented, and I think Nick is probably right when he says that nobody has more insight and experience with optimizing Python execution than Armin.

But even Armin has essentially conceded that optimizing Python really requires optimization at a lower level, which is why PyPy is a meta-tracing JIT. However, PyPy has made the irreversible architectural decision that that level should be merely an opaque implementation detail; the compiler knows best.

An alternative view is that language runtimes should be layered, but always transparent.

Given the recent massive increase of commercial investment in LLVM, and the existence of tools in that ecosystem like DragonEgg, syntax really ceases to be a lock-in feature of a language. (Yes, I know that sounds counter-intuitive.) Instead, what matters more is a runtime’s ability to play nicely with others, and of course its stable of libraries which idiomatically use that runtime. Python could be that runtime. Its standard library could become the equivalent of a dynamic language libc.

Python gained popularity in its first decade because it was a non-write-only Perl, and it worked well with C. It exploded in popularity in its second decade because it was more portable than Java, and because the AMD-Intel led to spectacular improvements in CPU performance, so that an interpreted language was fast enough for most things. For Python to emerge from its third decade as the dynamic language of choice, its core developers and the wider developer community/family will have to make wise, pragmatic choices about what the core strengths of Python are, and what things are best left to others.

View in this light, stressing Unicode over mere performance is a very justifiable decision that will yield far-reaching, long term returns for the language. (FWIW, this is also why I keep trolling Guido about better DSL support in Python; “playing nicely with others” in a post-LLVM world means syntax interop, as well.)

The good news is that the python core developers have been consistently great at making pragmatic choices. One new challenge is that the blogosphere/twittersphere has a logic unto itself, and can lead to very distracting, low signal-to-noise ratio firestorms over nothing. (fib(), anyone?) Will Python survive the noise- and gossip-mill of the modern software bazaar? Only time will tell…

Tagged

A Sketch of the Future of (Mobile) Computing

I saw two interesting tech news items from this morning.

Google Begins Testing Its Augmented Reality Glasses
and
Motorola is Turning Android into a Desktop OS

In the future, what data/apps/preferences are not stored in the cloud and streamed to your device will be encapsulated in a small digital token that you keep with you and plug into any available local hardware. The idea of lugging around a laptop or even a phone that is a physical container for your data will be utterly outdated. Consider: Your iPhone contains about 32GB of storage. You can, today, go into a Best Buy and get a 32GB mini-SDHC card that is the size of your pinky nail. The SIM card of your phone is equal in size.

So, the only thing that distinguishes your phone from any other phone in the world can physically fit on something the size of a fingernail. The only challenges are software ones: apps would need to recognize a larger set of hardware than they currently do, but Apple, Google, and Microsoft all have their own strategic initiatives to tackle that challenge.

So instead of plugging a phone into a pad (like the motorola thing), you will plug a tiny data crystal into any computing device, and have your data, your apps, your contacts, your photos, etc. all right there.

The augmented reality glasses are a way for you to always have some of your data available to you, even if you are away from a larger computing devices. Like with my Looxcie, it will be life streaming your experiences to your data crystal over bluetooth, and will have a minimal voice-activated dialer/phone interface that uses whatever local network is available. If you are on a mobile cellular network, it will use the subscriber information off your data crystal to connect to the cellular provider; if you are on a Wifi, it will use standard internet. (FaceTime & iMessage already does this.)

The crux is that data has traditionally been confined/jailed in physical devices, but storage has gotten so cheap and bandwidth has become so pervasive that this no longer makes sense. So the real challenge is to deliver a software development platform (and ecosystem) that allows developers to target multiple devices, contexts, and usage environments. Apple wants to do this by unifying iOS and desktop, and having storage use iCloud. Google wants to use Android as the underlying mobile OS, but both Google and Microsoft are betting on HTML and the web as the application environment.

Javascript Refresher Cheat Sheet

Last week I dove back into Javascript after being away from it for a while, and I wrote up a “Javascript Refresher Cheat Sheet” on the basis of some of the books and web resources I read.

Watson is winning at buzzing, not Jeopardy

It’s been inspiring to watch IBM’s Watson kicking butt on Jeopardy, since I am a scientific programmer and understand the difficulty of the problem the Watson team is attempting to solve. However, I can’t help but notice that Watson seems to be having much better luck nailing the buzzer, compared to its human counterparts.

Years ago, a family friend ended up on Jeopardy, and after the experience, she commented that she had underestimated the importance of finessing the buzzer. A Google search for “jeopardy buzzer” turns up quite a few pages, including this very informative page entitled “How to Win on the Buzzer”, by Michael Dupee, a former Jeopardy contestant.

It’s not clear to me how Watson is notified that Alex Trebek has finished reading the clue, but it seems pretty clear to me that no matter how you do it, the computer clearly has an advantage. If, for instance, the computer is simply sent a signal that is wired in to the same system that the off-stage assistant uses to enable the human contestants’ buzzers, then the computer can instantly respond with almost zero latency as soon as it gets the signal. There is no way that a human can compete with that, because the humans that rely on the pin light to notify them of buzzer activation will always be late. Alternatively, those that try to “time” the assistant and guess when he feels Trebek has finished reading the clue will never have the microsecond accuracy that Watson has.

Alternatively, if a direct signal is not sent, but rather Watson is equipped with an audio sensor to process Trebek’s voice as he reads the clue, it’s very easy to write a simple optimization routine that quickly learns when the assistant activates the buzzers. Watson can expend a tiny, minuscule fraction of a single processor to this task, and still be orders of magnitude more accurate at timing than its human competitors.

The point is that if one were to replace Watson with a human being that is every bit as knowledgeable and capable as Watson, the human being would not fare nearly as well in competition, simply because his or her motor response cannot beat that of a specialized robot. So, while Watson’s ability to understand and solve open-ended Jeopardy clues is certainly impressive, the reason he is trouncing the humans seems to have more to do with robotics rather than reasoning.

Mail.app freezing and Growl

Today I finally got fed up with Mail.app freezing randomly when checking email. I personally noticed that it seemed related to GrowlMail, and a quick Google search seemed to indicated that I’m not the only one.

I am using Jim Mitchell’s instructions on removing GrowlMail for the time being.

Is it bad that when I see:

/Library/Receipts/growlmailPreflight.pkg

I think “She packed by bags last night pre-flight”?

In defense of BSD licenses

Zed Shaw just blogged about why he uses the GPL instead of a BSD-style license. His arguments for the GPL are interesting but I feel that counterpoint is needed, since at Enthought (where I work), we try to BSD as much of our code as possible.

“It’s the Author’s Right”

First off, I’d like to encourage everyone in the Python community to be respectful. It is never appropriate to get angry at another person for their choice of license, and I am disappointed that Zed feels slighted by the community reaction to his work. (If you don’t like the license on a package, it is sometimes appropriate to ask the author if they’d consider changing it, but obviously beggars can’t be choosers.) Hopefully it’s just a few bad apples (“ungrateful turds”, as he calls them). That being said, “it’s the author’s right” is not really a reason for the GPL, just an admonishment.

“I Don’t Want To Be Ignored Again”/”You Have To Tell Your Boss You’re Using My Gear”

The fact that Zed wrote Mongrel and got no recognition is possibly an indictment of several things: the RoR buzzstorm, the Rails community, the “OMG Ruby is the new Java for Web 2.0” technorati, maybe even venture capitalists. But it is not an indictment of the BSD license.

Numpy and Scipy are two very successful projects, and they are BSD licensed. They have healthy communities, and many people make a living off of consulting work or commercial projects built on those tools. There are also countless companies whose staff make use of them internally, and occasionally give back to the projects. I would consider both Numpy and Scipy to be healthy, open-source projects with plenty of mature collaboration between commercial, academic, and hobbyist users. If either project were suddenly to become GPL, they would lose their commercial viability and the communities would undoubtedly suffer.

SAGE, an open-source Pythonic replacement for Maple/Mathematica/Magma/Matlab, is another very successful project, but they staunchly use the GPL. Their reasoning is much like Zed’s, because the symbolic math software community has been burned in the past by people profiting from proprietary extensions of BSD code without attribution or contribution. However, the GPL means that people cannot realistically use SAGE in a commercial tool, either as a platform/runtime, or as an embedded component. The SAGE authors have, presumably, weighed the trade-offs and decided it’s ultimately more valuable to be protected than to have the contributions of that segment of developers.

The style of license both depends on and defines the type of community that develops around a project. If you feel that the potential audience of your project consists of the sort of people that are liable to use your code without attribution and without becoming part of the user community, then by all means protect yourself with the GPL. If your code is the outgrowth of a community that already has a healthy number of commercial users, then there’s usually no downside risk of using BSD, while you get the upside of reaching a larger audience. Based on what I’ve seen, the Python community has a pretty healthy mix of commercial developers, and so as a whole I don’t think people there get burned by choosing the BSD license.

“If It’s Good, Pay For It”

Here is where we are in agreement, but there are numerous ways to approach this. Phil Thompson uses a dual licensing scheme with PyQT, wherein commercial developers have to pay for it. Travis Oliphant implemented an interesting “world price”/community fixed-fee scheme to fund the development of Numpy: he wrote a big pile of documentation (the Numpy Book) and sold it until a certain total dollar amount had been reached, at which point the book became freely available. At Enthought, we earn consulting contracts based on our BSD-licensed Enthought Tool Suite and our involvement with the Numpy and Scipy projects.

BSD/LGPL does not imply that you will not make money, and GPL does not ensure that you do. The only way to ensure that you *do* make money is to explicitly dual license.
(Edited: as some have pointed out, dual licensing basically requires the use of the GPL with a commercial license, as BSD does not prohibit commercial use.)

“Finally, Value”

I think there is a very good discussion to be had about how to commercialize the success of open source. Talented coders need to be compensated so they can afford to continue to innovate. Users should be free to use code however they wish, with no limitations on their freedoms, because code is ultimately a form of expression. But we need an interaction model that allows the expression of values and economic preferences without grounding certain values to zero, which is that traditional OSS licenses tend to do. As the practice of software development matures, we simply have to find a better economic model than the traditional Stallman-Gates bifurcation.

However, I think that choosing GPL or BSD is orthogonal to whether or not you feel your work is valuable; it is merely a way to define the kind of community you want around your project. If the community is filled with selfish, short-term opportunists, then protect your code and yourself with the GPL. If the community has a large, healthy contingent of commercial developers, then you’re only hurting yourself if you shut them out.

I recognize that in the scientific Python community, we’ve been extremely lucky to have developed the user base that we have, but I think that has largely been possible *because* we use the BSD license.

Tagged ,