|
|
Subscribe / Log in / New account

Resetting PHP 6

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
March 24, 2010
Rightly or wrongly, many in our community see Perl 6 as the definitive example of vaporware. But what about PHP 6? This release was first discussed by the PHP core developers back in 2005. There have been books on the shelves purporting to cover PHP 6 since at least 2008. But, in March 2010, the PHP 6 release is not out - in fact, it is not even close to out. Recent events suggest that PHP 6 will not be released before 2011 - if, indeed, it is released at all.

PHP 6 was, as befits a major release, meant to bring some serious changes to the language. To begin with, the safe_mode feature which is the whipping boy for PHP security - or the lack thereof - will be consigned to an unloved oblivion; the "register_globals" feature will be gone as well. The proposed traits feature would bring "horizontal reuse" to the language; think of traits as a PHPish answer to multiple inheritance or Java's interfaces. A new 64-bit integer type is planned. PHP was slated to gain a goto keyword (though the plan was to avoid the scary goto name and add target labels to break instead). Some basic static typing features are under consideration. There was even talk of adding namespaces to the language and making function and class names be case-sensitive.

The really big change in PHP 6, though, was the shift to Unicode throughout. Anybody who is running a web site which does not use Unicode is almost certainly wishing that things were otherwise - trust your editor on this one. It is possible to support Unicode to an extent even if the language in use is not aware of Unicode, but it is a painful and error-prone affair; proper Unicode support requires a language which understands Unicode strings. The PHP 6 plan was to support Unicode all the way:

PHP6 will have Unicode support everywhere; in the engine, in extensions, in the API. It's going to be native and complete; no hacks, no external libraries, no language bias. English is just another language, it's not the primary language.

Unicode, however, appears to be the rock upon which the PHP 6 ship ran aground. Despite claims back in 2006 that the development process was "going pretty well," it seems that few people are happy with the state of Unicode support in PHP. Memory usage is high, performance is poor, and broken scripts are common. The project has been struggling for some time to find a solution to this problem.

From your editor's reading of the discussion, the fatal mistake would appear to be the decision to use the two-byte UTF-16 encoding for all strings within PHP. According to PHP creator Rasmus Lerdorf, this decision was made to ease compatibility with the International Components for Unicode (ICU) library:

Well, the obvious original reason is that ICU uses UTF-16 internally and the logic was that we would be going in and out of ICU to do all the various Unicode operations many more times than we would be interfacing with external things like MySQL or files on disk. You generally only read or write a string once from an external source, but you may perform multiple Unicode operations on that same string so avoiding a conversion for each operation seems logical.

But a lot of strings simply pass through PHP programs; in the end, the conversion turned out to be more expensive and less convenient than had been hoped. Johannes Schlüter describes the problem this way:

By using UTF-16 as default encoding we'd have to convert the script code and all data passed from or to the script (request data, database results, output, ...) from another encoding, usually UTF-8, to UTF-16 or back. The need for conversion doesn't only require CPU time and more memory (a UTF-16 string takes double memory of a UTF-8 string in many cases) but makes the implementation rather complex as we always have to figure out which encoding was the right one for a given situation. From the userspace point of view the implementation brought some backwards compatibility breaks which would require manual review of the code.

These all are pains for a very small gain for many users where many would be happy about a tighter integration of some mbstring-like functionality. This all led to a situation for many contributors not willing to use "trunk" as their main development tree but either develop using the stable 5.2/5.3 trees or refuse to do development at all.

The end result of all this is that PHP 6 development eventually stalled. The Unicode problems made a release impossible while blocking other features from showing up in any PHP release at all. Eventually some work was backported to 5.3, but that is always a problematic solution; it brings back memories of the 2.5 kernel development series.

Developer frustration, it seems, grew for some time. Last November, Kalle Sommer Nielsen tried to kickstart the process, saying:

I've been thinking for a while what we should do about PHP6 and its future, because right now it seems like there isn't much future in it.

Things came to a head on March 11, when Jani Taskinen, fed up with being unable to push things forward, (1) committed some disruptive changes to the stable 5.3 branch, and (2) created a new PHP_5_4 branch which looked like it was meant to be a new development tree. That is when Rasmus stepped in:

The real decision is not whether to have a version 5.4 or not, it is all about solving the Unicode problem. The current effort has obviously stalled. We need to figure out how to get development back on track in a way that people can get on board. We knew the Unicode effort was hugely ambitious the way we approached it. There are other ways.

So I think Lukas and others are right, let's move the PHP 6 trunk to a branch since we are still going to need a bunch of code from it and move development to trunk and start exploring lighter and more approachable ways to attack Unicode.

And that is where it stands. The whole development series which was meant to be PHP 6 has been pushed aside to a branch, and development is starting anew based on the 5.3 release. Anything of value in the old PHP 6 branch can be cherry-picked from there as need be, but the process of what is going into the next release is beginning from scratch, and one assumes that proposals will be looked at closely. There are no timelines or plans for the next release at this point; as Rasmus explains, that's not what the project needs now:

We don't need timelines right now. What we need is some hacking time and to bring some fun back into PHP development. It hasn't been fun for quite a while. Once we have a body of new interesting stuff, we can start pondering releases...

So timing and features for the next PHP release are completely unknown at this point. Even the name is unknown; Jani's 5.4 branch has been renamed to THE_5_4_THAT_ISNT_5_4. There has been some concern about all of those PHP 6 books out there; it has been suggested that a release which doesn't conform to expectations for PHP 6 should be called something else - PHP7, even. There's little sympathy for the authors and publishers of those books, but those who bought them may merit a little more care. But that will be a discussion for another day. Meanwhile, the PHP hackers are refocusing on getting things done and having some fun too.



(Log in to post comments)

Resetting PHP 6

Posted Mar 24, 2010 16:20 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

"From your editor's reading of the discussion, the fatal mistake would appear to be the decision to use the two-byte UTF-16 encoding for all strings within PHP. According to PHP creator Rasmus Lerdorf, this decision was made to ease compatibility with the International Components for Unicode (ICU) library"

Stupid. UTF-16 is the worst of both worlds. It's a variable length encoding AND it requires to convert almost all incoming/outgoing data.

They should have used UTF-32 or UTF-8. UTF-32 probably is more preferable.

UTF8 vs UTF32

Posted Mar 24, 2010 16:52 UTC (Wed) by dlang (guest, #313) [Link]

which is better depends on what you expect to do with the strings.

if what you are doing with the strings is mostly storing, matching, concatenating and outputting them, UTF8 is better due to it's smaller size. In these cases the program doesn't really care what the data is, it's just a string of bytes to work with, the fact that it happens to be human readable is a side effect.

If you are searching the string, parsing the string with variable-length items in the string, there is still no real penalty to the variable length encoding of UTF8 as you have to start at the beginning of the string and walk it anyway.

If you have text with fixed size items in it that you need to access by character position, then UTF32 is better as every character is a fixed size, but there is a significant size penalty in manipulating and copying these strings around, so unless you are doing this a lot, you may find you are better off just walking the string. Remember that on modern CPUs a smaller cache footprint usually translates directly to higher performance. Walking a small string that fits into the cache a dozen times will be faster than accessing a larger string that doesn't fit in cache once (completely ignoring the executable code that gets pushed out of the cache by the larger string)

UTF32 carries a large memory size penalty, plus the need to convert when reading/writing.

what it is good for is if you are doing a lot of string manipulation by position (i.e. you know that the field you want starts at position 50)

UTF8 vs UTF32

Posted Mar 24, 2010 17:57 UTC (Wed) by foom (subscriber, #14868) [Link]

> what it is good for is if you are doing a lot of string manipulation by position (i.e. you know that
> the field you want starts at position 50)

I've never actually heard of a good reason for anyone to want that. Remember that in Unicode,
what the user thinks of a character might in fact be a composition of many unicode codepoints (e.g.
base character + accent). Accessing the Nth codepoint is an almost entirely useless thing to
optimize your storage for.

And yet, most programming languages have made this mistake, because they treat strings as an
array of codepoints, rather than as a unique data structure. All you really want is forward-and-
backward iterable and random access given an iterator.

UTF8 vs UTF32

Posted Mar 24, 2010 19:49 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

>I've never actually heard of a good reason for anyone to want that.

If you have to translate from an old, COBOL-style fixed-width data format into something modern and/or at least reasonably sane, that's a very good reason to want just that.

UTF8 vs UTF32

Posted Mar 24, 2010 20:44 UTC (Wed) by foom (subscriber, #14868) [Link]

I'm gonna bet the old COBOL stuff is not actually outputting Unicode. :)

UTF8 vs UTF32

Posted Mar 24, 2010 20:45 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

Quite. But that IS a case where you want to do string manipulation by position.

UTF8 vs UTF32

Posted Mar 25, 2010 14:14 UTC (Thu) by dgm (subscriber, #49227) [Link]

Fine then. If you limit yourself to the ASCII subset, UTF-8 offers you that. Is that enough for COBOL?

Iterators vs indices

Posted Mar 25, 2010 23:08 UTC (Thu) by butlerm (subscriber, #13312) [Link]

I've never actually heard of a good reason for anyone to want that.

Most languages (and SQL in particular) work exclusively using string indexes. You cannot use an iterator in a functional context. SQL doesn't have string iterators and never will. Iterators are a procedural thing.

Numeric indices are the only language independent option available for specifying and extracting parts of strings. That is why they are universal. I wouldn't use a language that didn't support them, in large part due to the difficulty of translating idioms from languages that do.

As a consequence, anything that needs to be done to efficiently present a string of any reasonable size as a linear array of _characters_ (or at least code points) is the language's problem, not the programmer's. That is the approach SQL takes, and from an application programmer's point of view it works very well.

That is not to say that an iterator interface shouldn't be provided (in procedural languages) as well, but of the two an index based interface is more fundamental.

Iterators vs indices

Posted Mar 26, 2010 13:49 UTC (Fri) by foom (subscriber, #14868) [Link]

I think you must be using a funny definition of functional. There is absolutely nothing that prevents an iterator-based API from working in a functional language. And of course most functional languages have many such APIs.

Let's take a traditional example: singly-linked-lists are a quite common data-structure in functional (or mostly-functional) languages like Haskell, Scheme, etc. Yet, you don't index them by position (that of course is available if you need it, but it's time O(n), so you don't normally want to use it). Instead, you use an iterator, which in this case is a pointer to the current element.

If anyone suggested that the primary access method for a singly linked list should be by integer position, they'd be rightly told that's insane -- iterating over the list would take O(n^2)!

Now, maybe your real point was simply that existing languages already have a poorly-designed Unicode String API that they have to keep compatiblity with -- and that API doesn't include iterators. So, they therefore have constraints they need to preserve, such as O(1) access by character index, because existing programs require it.

I won't argue with that, but I still assert it's not actually a useful feature for a unicode string API, in absence of the API-compatibility requirement.

Iterators vs indices

Posted Mar 26, 2010 22:09 UTC (Fri) by spitzak (guest, #4593) [Link]

If you really can't get away from the integer index, a solution is to have a string format that stores the most recent index computed and where it was in the string. Then when asked for a new index it will move from that previous position to the new one if the previous position is less that 2x the new one.

For the vast majority of cases where each integer starting from zero is used to get the "character" this would put the implementation back to O(1). And it would allow more complex accessors, such as "what error is here".

Iterators vs indices

Posted Mar 30, 2010 6:57 UTC (Tue) by njs (subscriber, #40338) [Link]

Another option is to store a string as a tree structure, where the leaves are some reasonable-sized chunks of bytes (to amortize storage overhead), and the tree nodes are annotated with the number of characters/bytes/code points/lines/whatever that occur underneath them. This allows random O(log n) access by character/byte/... offset. (You can maintain several different sorts of counts, and get fast access for all of them in the same data structure.) You also get cheap random insertion/deletion, which is an important operation for some tasks (e.g., editor buffers!) but horrendously slow for arrays.

For some reason nobody does this, though.

Iterators vs indices

Posted Mar 30, 2010 7:11 UTC (Tue) by dlang (guest, #313) [Link]

the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

besides, as noted earlier in this thread, most uses of strings really don't care how they break apart, they are almost always used as-is (or at most with one step of parsing, usually on whitespace, on input) as such, anything more than the most compact representation ends up costing significantly more in memory size (and therefor cache space) than you gain with any string manipulation that you do

Google Wave actually stores strings the way you are suggesting, or did when I saw the presentation on it last year, but I think that doing so will keep it from being used for anything beyond trivial uses.

Iterators vs indices

Posted Mar 30, 2010 7:46 UTC (Tue) by njs (subscriber, #40338) [Link]

> the biggest reason nobody stores strings that way is the overhead. it requires many pointers which end up making UTF-32 look compact by comparison.

The memory overhead is certainly not as high as UCS-32 (at least for strings where UTF-8 has lower overhead than UCS-32 to start with) -- you need something like 3*log_2(n) words of overhead, but n is the number of "chunks", not bytes, and a reasonable chunk-size is in the hundreds of bytes, at least. Within a chunk you revert to linear behavior, but that's not so bad, IIUC on modern CPUs linear-time is not much worse than constant-time when it comes to accessing short arrays.

Most strings are short, and with proper tuning they'd probably fit into one chunk anyway, so the overhead is nearly nil.

But you're right, there is some overhead -- not that this stops people from using scripting languages -- and a lot of tricky implementation, and simple solutions are often good enough.

I don't understand what you mean about Google Wave, though. A) Isn't it mostly a protocol? Where do string storage APIs come in? B) It's exactly the non-trivial uses -- where you have large, mutable strings -- that arrays and linear-time iteration don't scale to.

Iterators vs indices

Posted Mar 31, 2010 2:05 UTC (Wed) by dlang (guest, #313) [Link]

I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

google wave uses the jabber protocol, but in it's documents it doesn't store words, it stores the letters individually, grouped togeather so that they can be changed individually (or so it was explained by the google rep giving the presentation I was at)

Iterators vs indices

Posted Mar 31, 2010 4:30 UTC (Wed) by njs (subscriber, #40338) [Link]

> I understood the poster to mean using pointers for individual characters (how else can you do inserts at any point in the string without having to know how it's structured)

I'm afraid I don't understand at all. I *am* that poster, and the data structure I described can do O(log n) inserts without pointers to individual characters. Perhaps I am just explaining badly?

Iterators vs indices

Posted Mar 30, 2010 8:12 UTC (Tue) by nix (subscriber, #2304) [Link]

Didn't the GNU C++ ext/rope work in exactly this way?

Iterators vs indices

Posted Mar 30, 2010 17:06 UTC (Tue) by njs (subscriber, #40338) [Link]

No, interestingly -- they are more complicated and less like a conventional tree structure than one would think: http://www.sgi.com/tech/stl/ropeimpl.html

The most important difference is that ropes are happy -- indeed, delighted -- to store very long strings inside a single tree node when they have the chance, because their goal is just to amortize mutation operations, not to provide efficient access by semi-arbitrary index rules.

UTF-16

Posted Mar 24, 2010 17:16 UTC (Wed) by clugstj (subscriber, #4020) [Link]

What were the Java people smoking when they picked UTF-16?

UTF-16

Posted Mar 24, 2010 17:47 UTC (Wed) by Nahor (subscriber, #51583) [Link]

When Java was invented, Unicode was 16 bits only, UTF-16 didn't exist and UCS-2 was the encoding of choice. So it all made sense at the time.

Shortly after, Unicode was extended to 32 bits (Unicode 2.0).
Java became UTF-16 only with Java 5.0 (JDK/JRE 1.5) and UTF-8 was not much of an option anymore if they wanted to stay compatible with older Java code.

It's the same thing with the Unicode support in Windows by the way (except that they are still UCS-2 AFAIK)

UTF-16

Posted Mar 24, 2010 20:35 UTC (Wed) by wahern (subscriber, #37304) [Link]

Unicode was never 16-bits, nor is it now 32-bits. It was never just that simple.

Nor was UCS-2 fixed-width per se. It was fixed-width for the BMP, and only with several caveats which made it largely useless for anything but European languages and perhaps a handful of some Asian languages. (Disregarding, even, American languages and others around the world.)

Sun's and Microsoft's decision to wed themselves to, effectively, this restricted UCS-2 model (even if nominally they advertise UTF-16 now), was short-sighted, plain and simple. Not only short-sighted, but stupid. It merely replaced one problem--a multitude of character set schemes--with another--a fixed scheme that still only supported the most basic semantics of a handful of scripts from technologically advanced nations. All the work was retrospective, not at all prospective.

None of UTF-8, UTF-16, or UTF-32 make functionally manipulating Unicode text any easier than the other. Combining characters are just the tip of the iceberg. Old notions such as word splitting are radically different. Security issues abound regarding normalization modes. All the hoops programmers jump through to "support" Unicode a la Win32 or Java doesn't actually provide much if any multilingual benefit beyond Latin and Cyrillic scripts, with superficial support for some others. (In other words, if you think your application is truly multilingual, you're sheltered and deluded; but the fact remains that billions of people either can't use your application, or don't expect it to work for them well anyhow--many other aspects of modern computers are biased toward European script).

To properly support Unicode--including the South and Southeast Asian languages which have billions of readers--applications need to drop all their assumptions about string manipulation. Even simple operations such as concatenation have caveats.

What languages should do is to leave intact existing notions of "strings", which are irreconcilably burdened by decades of ingrained habit, and create new syntax, new APIs, and new libraries which are thoroughly Unicode oriented.

And none of this even considers display, which poses a slew of other issues.

UTF-16

Posted Mar 24, 2010 20:38 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

>What languages should do is to leave intact existing notions of "strings", which are irreconcilably burdened by decades of ingrained habit, and create new syntax, new APIs, and new libraries which are thoroughly Unicode oriented.

So, Unicode text would have to be represented by some other data structure?

Intriguing. What would you suggest? Obviously a simple array would not cut it, that's what we have now and that's what you're arguing against. So, what would we use instead? Linked lists of entities? Trees of some kind?

UTF-16

Posted Mar 24, 2010 21:53 UTC (Wed) by elanthis (guest, #6227) [Link]

Don't be facetious. That a string is internally represented as an array of
bytes or codepoints is an entirely different thing than its client API being
purely iterator based. The problem he was talking about was that the client
API to strings in most languages is to expose it as an array of characters
even though internally it _isn't_ an array of characters, it's an array of
bytes or codepoints. The accessors for strings really don't make much sense
as an array, either, because an array is something indexable by offset, which
makes no sense: what exactly is the offset support to represent? Bytes?
Codepoints? Characters? You can provide a firm answer to this question, of
course, but the answer is only going to be the one the user wants some of the
time. A purely iterator based approach would allow the client to ask for a
byte iterator, a codepoint iterator, or even a character iterator, and get
exactly the behaviour they expect/need.

UTF-16

Posted Mar 24, 2010 21:56 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

I was not being facetious. The idea of representing strings using a different data structure was actually something I was thinking was an interesting idea.

But, you're right, there's no need for internal and external representations to match. At least, on some levels. At some level you do have to be able to get at the array of bytes directly.

UTF-16

Posted Mar 24, 2010 21:07 UTC (Wed) by ikm (subscriber, #493) [Link]

> Nor was UCS-2 fixed-width per se. It was fixed-width for the BMP

Would you elaborate? To the best of my knowledge, UCS-2 has always been a fixed-width BMP representation. Even its name says so.

> useless for anything but European languages and perhaps a handful of some Asian languages

Again, what? Here's a list of scripts BMP supports: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Bas.... That's the majority of all scripts in Unicode. Pretty much no one really needs other planes.

> billions of people either can't use your application, or don't expect it to work for them well

Billions? Don't you think you exaggerate a lot? Unicode has a lot of quirks, but it works for the majority of people just fine in most scenarios. In your interpretation, though, it feels just like the opposite: Unicode is impossible and never works, half the world just can't use it at all. That's not true.

UTF-16

Posted Mar 25, 2010 5:20 UTC (Thu) by wahern (subscriber, #37304) [Link]

> Would you elaborate? To the best of my knowledge, UCS-2 has always been a fixed-width BMP representation. Even its name says so.

The problem here is conflating low-level codepoints with textual semantics. There are more dimensions than just bare codepoints and combining codepoints (where w/ the BMP during the heyday of UCS-2 you could always find, I think, a single codepoint alternative for any combining pair).

Take Indic scripts for example. You could have multiple codepoints which while not technically combining characters, require that certain rules are followed; together with other semantic forms collectively called graphemes and grapheme clusters. If you split a "string" between these graphemes and stitch them back together ad hoc, you may end up w/ a non-sense segment that might not even display properly. In this sense, the fixed-width of the codepoints is illusory when you're attempting to logically manipulate the text. Unicode does more than define codepoints; it also defines a slew of semantic devices intended to abstract text manipulation, and these are at a much higher level than slicing and dicing an array of codepoints. (As noted elsethread, these can be provided as iterators and special string operators.)

It's been a few years since I've worked with Chinese or Japanese scripts, but there are similar issues. Though because supporting those scripts is a far more common exercise for American and European programmers, there are common tricks employed--lots of if's and then's littering legacy code--to do the right things in common cases to silence the Q/A department fielding calls from sales reps in Asia.

> Again, what? Here's a list of scripts BMP supports: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Bas.... That's the majority of all scripts in Unicode. Pretty much no one really needs other planes.

"Majority" and "pretty much". That's the retrospective problem that still afflicts the technical community today. Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP. What happens when somebody wants to create a web application around these texts in their traditional script? With an imagination one could imagine all manner of new requirements just around the corner that will continue to erode the analysis that the BMP is "good enough". For example the phenomenon may reverse of simplifying scripts in various regions, begun in part because of perceived complexity viz-a-viz the limitations of modern computing hardware and/or inherited [racist] notions of cultural suitability. Mao's Simplified Chinese project may turn out to be ahistorical, like so many other modernist projects to fix thousands of years of cultural development around the world.

Of course, the whole idea that the BMP is "good enough" is nonsensical from the get go. in order to intelligently handle graphemes and grapheme clusters you have to throw out the notion of fixed-width anything, period.

> Billions? Don't you think you exaggerate a lot? Unicode has a lot of quirks, but it works for the majority of people just fine in most scenarios.

I don't think I exaggerate. First of all, as far as I know, Unicode is sufficient. But I've never actually seen an open source application--other than ICU--that does anything more effectively with Unicode than use wchar_t or similar concessions. (Pango and other libraries can handle many text flow issues, but the real problems today lie in document processing.)

I think it's a fair observation that the parsing and display of most non-European scripts exacts more of a burden than for European scripts. For example (and this is more about display than parsing) I'm sure it rarely if ever crosses the mind of a Chinese or Japanese script reader that most of the text they read online will be displayed horizontally rather than vertically. But if go into a Chinese restaurant the signage and native menus will be vertical. Why can't computers easily replicate the clearly preferable mode? (Even if neither is wrong per se.) I think the answer is because programmers have this ingrained belief that what works for their source code editor works for everything else. Same models of text manipulation, same APIs. It's an unjustifiable intransigence. And we haven't been able to move beyond it because the solutions so far tried simply attempt to reconcile historical programming practice w/ a handful of convenient Unicode concepts. Thus this obsession with codepoints, when what should really be driving syntax and library development aren't these low-level concerns but how to simplify the task of manipulating graphemes and higher-level script elements.

UTF-16

Posted Mar 25, 2010 11:04 UTC (Thu) by tetromino (subscriber, #33846) [Link]

> You could have multiple codepoints which while not technically combining characters, require that certain rules are followed; together with other semantic forms collectively called graphemes and grapheme clusters. If you split a "string" between these graphemes and stitch them back together ad hoc, you may end up w/ a non-sense segment that might not even display properly.
I am not sure if I understand you. Would you mind giving a specific example of what you are talking about? (You will need to select HTML format to use Unicode on lwn.net.)

> Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP.
25 centuries of linguistic evolution separate us from Confucius. Suppose you can display all the ancient characters properly; how much would that really help a modern Chinese speaker understand the meaning of the text? Does knowing the Latin alphabet help a modern French speaker understand text written in Classical Latin?

> But if go into a Chinese restaurant the signage and native menus will be vertical. Why can't computers easily replicate the clearly preferable mode?
Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese speakers because that's what you saw being used in a restaurant menu?

UTF-16

Posted Mar 26, 2010 19:31 UTC (Fri) by spacehunt (guest, #1037) [Link]

I'm a native Cantonese speaker in Hong Kong, hopefully my observations would serve as useful reference...

> 25 centuries of linguistic evolution separate us from Confucius. Suppose you can display all the ancient characters properly; how much would that really help a modern Chinese speaker understand the meaning of the text? Does knowing the Latin alphabet help a modern French speaker understand text written in Classical Latin?

A lot of Chinese characters in modern usage are outside of the BMP:
http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00...

> Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese speakers because that's what you saw being used in a restaurant menu?

It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan. Just go to any bookstore or newspaper stand in these three places and see for yourself.

UTF-16

Posted Mar 31, 2010 4:35 UTC (Wed) by j16sdiz (subscriber, #57302) [Link]

> > Are you seriously claiming that top-to-bottom is "the clearly preferable" writing mode for modern Chinese
speakers because that's what you saw being used in a restaurant menu?

> It may not be "clearly preferable", but it certainly is still widely used at least in Hong Kong, Taiwan and Japan.
Just go to any bookstore or newspaper stand in these three places and see for yourself.

As a Chinese living in Hong Kong I can tell you this:
Most of the Chinese characters are in BMP. Some of those outside BMP are used in Hong Kong, but they are not
as important as you think -- most of them can be replaced with something in BMP (and that's how we have been
doing this before the HKSCS standard)

And yes, you can have Confucius in BMP. (Just like how you have KJV bible in latin1 -- replace those long-S
with th, and stuff like that)

UTF-16

Posted Mar 25, 2010 11:41 UTC (Thu) by ikm (subscriber, #493) [Link]

You: BMP [..is..] largely useless for anything but European languages and perhaps a handful of some Asian language

Me: Here's a list of scripts BMP supports. That's the majority of all scripts in Unicode.

You: "Majority" and "pretty much". Almost all classic Chinese text (Confucius, etc.) use characters beyond the BMP.

So, BMP without classic Chinese is largely useless? Nice. You know what, enough of this nonsense. Your position basically boils down to "if you can't support all the languages in the world, both extinct and in existence, 100.0% correct, and all the features of Unicode 5.0, too, your effort is largely useless". But it's not; the world isn't black and white.

UTF-16

Posted Mar 25, 2010 12:48 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

Unicode and i18n is a long history of "but this thing people write in real life is hard, can't I simplify it to make my latin-oriented code simpler?"

And a few years later the pressure has mounted enough you do need to process the real thing, not a simplified model, and you need to do the work you didn't want to do in the first place *and* handle all the weird cases your previous shortcuts generated.

The "good enough" i18n school has been a major waste of development so far. It has proven again and again to be shortsighted

UTF-16

Posted Mar 25, 2010 13:04 UTC (Thu) by ikm (subscriber, #493) [Link]

> you do need to process the real thing, not a simplified model

You see, no, I don't. I have other stuff in my life than doing proper support for some weird stuff no one will ever actually see or use in my program.

UTF-16

Posted Mar 25, 2010 16:48 UTC (Thu) by JoeF (guest, #4486) [Link]

You see, no, I don't. I have other stuff in my life than doing proper support for some weird stuff no one will ever actually see or use in my program.

And what exactly makes you the final authority on using UTF?
While you may have no need to represent ancient Chinese characters, others may.
Just because you don't need it doesn't mean that others won't have use for it.
Your argument smacks of "640K should be enough for anyone" (misattributed to BillG).

UTF-16

Posted Mar 25, 2010 17:06 UTC (Thu) by ikm (subscriber, #493) [Link]

Oh, no no, I was only referring to my own private decisions. The original post stated the necessity for each and every one, and I disagreed. Others decide for themselves, of course.

p.s. And btw, you *can* represent ancient Chinese with UTF... The original post was probably referring to some much more esoteric stuff.

UTF-16

Posted Mar 25, 2010 15:13 UTC (Thu) by marcH (subscriber, #57642) [Link]

> ... because of perceived complexity viz-a-viz the limitations of modern computing hardware and/or inherited [racist] notions of cultural suitability.

The invention of alphabets was a major breakthrough - because they are inherently simpler than logographies. It's not just about computers: compare how much time a child typically needs to learn one versus the other.

> I think the answer is because programmers have this ingrained belief that what works for their source code editor works for everything else. Same models of text manipulation, same APIs.

Of course yes, what did you expect? This problem will be naturally solved when countries with complicated writing systems will stop waiting for the western world to solve problems only they have.

> It's an unjustifiable intransigence.

Yeah, software developers are racists since most of them do not bother about foreign languages...

UTF-16

Posted Mar 25, 2010 16:17 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

The Han characters aren't logograms*, but you're right that alphabetic writing systems are better and it isn't even hard to find native Chinese linguists who agree. Some believe that the future of Chinese as a spoken language (or several languages, depending on your politics) depends now on accepting that it will be written in an existing alphabet - probably the Latin alphabet - and the Han characters will in a few generations become a historical curiosity, like runes. For what it's worth your example of teaching children is very apt, I'm told that Chinese schools have begun using the Latin alphabet to teach (young) children in some places already.

* Nobody uses logograms. In a logographic system you have a 1:1 correspondence between graphemes and words. Invent a new word, and you need a new grapheme. Given how readily humans (everywhere) invent new words, this is quickly overwhelming. So, as with the ancient Egyptian system, the Chinese system is clearly influenced by logographic ideas, but it is not a logographic system, a native writer of Chinese can write down words of Chinese they have never seen, based on hearing them and inferring the correct "spelling", just as you might in English.

UTF-16

Posted Mar 25, 2010 19:24 UTC (Thu) by atai (subscriber, #10977) [Link]

As a Chinese, I can tell you that the Chinese characters are not going anywhere. The Chinese characters will stay and be used for Chinese writings, for the next 2000 years just as in the previous 2000 years.

The ideas that China is backwards because of the language and written characters should now go bankrupt.

UTF-16

Posted Mar 25, 2010 22:26 UTC (Thu) by nix (subscriber, #2304) [Link]

Well, after the singular invention of alphabetic writing systems by some
nameless Phoenicians, Mesopotamians and Egyptians 2500-odd years ago,
*everyone* else was backwards. It's an awesome piece of technology. (btw,
the Chinese characters have had numerous major revisions, simplifications
and complexifications over the last two millennia, the most recent being
the traditional/simplified split: any claim that the characters are
unchanged is laughable. They have certainly changed much more than the
Roman alphabet.)

UTF-16

Posted Mar 25, 2010 23:21 UTC (Thu) by atai (subscriber, #10977) [Link]

I don't know if alphabetic writing is forwards or backwards.

But if you say Chinese characters changed more than the Latin alphabet, then you are clearly wrong; the "traditional" Chinese characters certainly stayed mostly the same since 105 BC (What happened in Korea, Japan or Vietnam do not apply because these are not Chinese).

I can read Chinese writings from the 1st Century; can you use today's English spellings or words to read English writings from the 13th Century?

UTF-16

Posted Mar 26, 2010 11:16 UTC (Fri) by mpr22 (subscriber, #60784) [Link]

I can read Chinese writings from the 1st Century; can you use today's English spellings or words to read English writings from the 13th Century?

13th Century English (i.e. what linguists call "Middle English") should be readable-for-meaning by an educated speaker of Modern English with a few marginal glosses. Reading-for-sound is almost as easy (95% of it is covered by "Don't silence the silent-in-Modern-English consonants. Pronounce the vowels like Latin / Italian / Spanish instead of like Modern English").

My understanding is that the Greek of 2000 years ago is similarly readable to fluent Modern Greek users. (The phonological issues are a bit trickier in that case.)

In both cases - and, I'm sure, in the case of classical Chinese - it would take more than just knowing the words and grammar to receive the full meaning of the text. Metaphors and cultural assumptions are tricky things.

McLuhan

Posted Apr 15, 2010 9:27 UTC (Thu) by qu1j0t3 (guest, #25786) [Link]

Anyone who wants to explore the topic of comparative alphabets further may find McLuhan's works, such as The Gutenberg Galaxy, rewarding.

UTF-16

Posted Mar 25, 2010 16:21 UTC (Thu) by paulj (subscriber, #341) [Link]

As a data-point, I believe children in China are first taught pinyin (i.e.
roman alphabet encoding of mandarin), and learn hanzi logography buiding on
their knowledge of pinyin.

UTF-16

Posted Mar 25, 2010 19:33 UTC (Thu) by atai (subscriber, #10977) [Link]

But pinyin is not a writing system for Chinese. It helps with teaching pronunciation.

UTF-16

Posted Mar 26, 2010 2:49 UTC (Fri) by paulj (subscriber, #341) [Link]

I have a (mainland chinese) chinese dictionary here, intended for kids,
and it is indexed by pinyin. From what I have seen of (mainland) chinese,
pinyin appears to be their primary way of writing chinese (i.e. most writing
these days is done electronically, and pinyin is used as the input
encoding).

UTF-16

Posted Mar 26, 2010 15:37 UTC (Fri) by chuckles (guest, #41964) [Link]

I'm in China right now learning Mandarin so I can comment on this. Children learn pinyin at the same time as the characters. The Pinyin is printed over the characters and is used to help with pronunciation. While dictionaries targeted towards little children and foreigners are indexed by pinyin, normal dictionaries used by adults are not. Dictionaries used by adults are indexed by the radicals.
While pinyin is nice, there are no tone markers. So you have a 1 in 5 chance (4 tones plus neutral) of getting it right.
You are correct that pinyin is the input system on computers, cell phones, everything electronic, in mainland china. Taiwan has its own system. Also, Chinese are very proud people, Characters aren't going anywhere for a LONG time.

UTF-16

Posted Mar 26, 2010 21:24 UTC (Fri) by paulj (subscriber, #341) [Link]

Yes, I gather formal pinyin has accents to differentiate the tones, but on a
computer you just enter the roman chars and the computer gives you an
appropriate list of glyphs to pick (with arrow key or number).

And yes they are. Shame there's much misunderstanding (in both directions)
though. Anyway, OT.. ;)

UTF-16

Posted Mar 25, 2010 20:23 UTC (Thu) by khc (guest, #45209) [Link]

I was raised in Hong Kong and not in mainland China, but I do have relatives in China. I've never heard that kids learn pinyin before the characters.

UTF-16

Posted Mar 26, 2010 2:44 UTC (Fri) by paulj (subscriber, #341) [Link]

This is what someone who was raised in China has told me.

UTF-16

Posted Mar 27, 2010 22:39 UTC (Sat) by man_ls (guest, #15091) [Link]

China has 1,325,639,982 inhabitants, according to Google. That is more than the whole of Europe, Russia, US, Canada and Australia combined. Even if there is a central government, we can assume a certain cultural diversity.

UTF-16

Posted Mar 28, 2010 4:22 UTC (Sun) by paulj (subscriber, #341) [Link]

Good point. :)

This was a Han chinese person from north-eastern China, i.e. someone from
the dominant cultural group in China, from the more developed part of China.
I don't know how representative their education was, but I suspect there's
at least some standardisation and uniformity.

UTF-16

Posted Dec 27, 2010 2:01 UTC (Mon) by dvdeug (subscriber, #10998) [Link]

You're writing in a language with one of the most screwed up orthographies in existence. Convince English speakers to use a reasonable orthography, and then you can start complaining about the rest of the world.

Not only that, some of these scripts you're not supporting are wonders. Just because Arabic is always written in cursive and thus needs complex script support, doesn't mean that it's not an alphabet that's perfectly suited to its language, that is in fact easier to learn for children, then the English alphabet is for English speakers.

Supporting Chinese or Arabic is like any other feature. You can refuse to support it, but if your program is important, patches or forks are going to float around to fix. Since Debian and other distributions are committed to supporting those languages, the version of the program that will be in the distributions will be the forked version. If there is no fork, they may just not include it. That's the cost you'll have to pay for ignoring the features they want.

UTF-16

Posted Mar 25, 2010 2:27 UTC (Thu) by Nahor (subscriber, #51583) [Link]

> Unicode was never 16-bits

http://en.wikipedia.org/wiki/Unicode#History:
Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits [...]
In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits.

> nor is it now 32-bits

Indeed it's less (http://en.wikipedia.org/wiki/Unicode#Architecture_and_ter...):
Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF

> Nor was UCS-2 fixed-width per se

http://en.wikipedia.org/wiki/Universal_Character_Set:
UCS-2, uses a single code value [...] between 0 and 65,535 for each character, and allows exactly two bytes to represent that value.

> [...]

Unicode/UTF-16/UCS-2/... may not be perfect, it's still better than what we had before. At least now we have a universal way of display foreign alphabets.
Byte arrays to represent a string may not be ideal but it's not worse than before. Features like word splitting may not be easy but they never were. And not all applications need such features. A lot of them just want to able to display Asian characters on an English OS.

UTF-16

Posted Mar 24, 2010 17:49 UTC (Wed) by jrn (subscriber, #64214) [Link]

I can guess a few reasons:

- For many human languages, UTF-16 is more compact than UTF-8.
- UTF-16 is hard to confuse with ISO 8859 and other old encodings.
- Java 1.0 preceded Unicode 2.0. There was no UCS-4 back then.

HTH, Jonathan

UTF-16

Posted Mar 24, 2010 18:25 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

"For many human languages, UTF-16 is more compact than UTF-8."

Care to cite any real evidence for actual text that's routinely processed by computers? It ought to be easy to instrument a system to measure this, but when the claim is made nobody seems to have collected such evidence (or, cynically, they have and it does not support their thesis).

See, people tend to make this claim by looking at a single character, say U+4E56

and they say, this is just one UTF-16 code unit, 2 bytes, but it is 3 bytes in UTF-8, so therefore using UTF-8 costs 50% extra overhead.

But wait a minute, why is the computer processing this character? Is it, perhaps, as part of a larger document? Each ASCII character used in the document costs 100% more in UTF-16 than UTF-8. It is common for documents to include U+0020 (the space character, even some languages which did not use spacing traditionally tend to introduce it when they're computerised) and line separators at least.

And then there's non-human formatting. Sure, maybe the document is written in Chinese, but if it's an HTML document, or a TeX document, or a man page, or a Postscript file, or... then it will be full of English text or latin character abbreviations created by English-using programmers.

So, I don't think the position is so overwhelmingly in favour of UTF-8 that existing, working systems should urgently migrate, but I would definitely recommend against using UTF-16 in new systems.

UTF-16

Posted Mar 25, 2010 0:04 UTC (Thu) by tetromino (subscriber, #33846) [Link]

> Care to cite any real evidence for actual text that's routinely processed by computers?

I selected 20 random articles in the Japanese Wikipedia (using http://ja.wikipedia.org/wiki/特別:おまかせ表示) and compared the size of their source code (wikitext) in UTF-8 and UTF-16. For 6 of the articles, UTF-8 was more compact; for the remaining 14, UTF-16 was better. The UTF-16 wikitext size ranged from +32% to -16% in size relative to UTF-8, depending on how much of the article's source consisted of wiki syntax, numbers, English words, filenames in the Latin alphabet, etc.

On average, UTF-16 was 2.3% more compact than UTF-8. And concatenating all the articles together, the UTF-16 version would be 3.2% more compact.

So as long as you know that your system's users will be mostly Japanese, it seems that migrating from UTF-8 to UTF-16 for string storage would be a small win.

UTF-16

Posted Mar 25, 2010 14:17 UTC (Thu) by Simetrical (guest, #53439) [Link]

Unless there's surrounding markup of any kind. If you looked at the HTML
instead of the wikitext, UTF-8 would win overwhelmingly.

In general: UTF-8 is at most 50% larger than UTF-16, while UTF-16 is at most
100% larger than UTF-8; and the latter case is *much* more likely than the
former in practice. There's no reason to use UTF-16 in any general-purpose
app -- UTF-8 is clearly superior overall, even if you can come up with
special cases where UTF-16 is somewhat better.

UTF-16

Posted Mar 25, 2010 15:46 UTC (Thu) by liljencrantz (guest, #28458) [Link]

Are you saying wikitext is not markup?

UTF-16

Posted Mar 25, 2010 18:05 UTC (Thu) by Simetrical (guest, #53439) [Link]

Yes, sorry, of course wikitext is markup. But it's (by design) very lightweight markup that only accounts for a small fraction of the text in most cases. If you're using something like HTML, let alone a programming language, UTF-8 is a clear win for East Asian text. tetromino's data suggests that even in a case with a huge advantage for UTF-16 (CJK with only light markup), it's still only a few percent smaller. That's just not worth it, when UTF-8 is *much* smaller in so many real-world cases.

UTF-16

Posted Mar 25, 2010 16:03 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

Thankyou for actually stepping up and measuring!

Resetting PHP 6

Posted Mar 24, 2010 18:09 UTC (Wed) by wingo (guest, #26929) [Link]

There are many possible string representations.

https://trac.ccs.neu.edu/trac/larceny/wiki/StringRepresen...

Guile internally uses latin-1 when possible, and utf-32 otherwise.

Resetting PHP 6

Posted Mar 26, 2010 4:11 UTC (Fri) by spitzak (guest, #4593) [Link]

No, UTF-8 is preferable.

The truly unavoidable technical reason is that only UTF-8 can safely encode UTF-8 errors. Lossless transmission of data is a requirement for safe and bug-free computing.

Other reasons:

1. Much faster due to no need to translate on input/output

2. Able to use existing apis to name files and parse text, rather than having to make an all-new api that takes "wide characters".

3. Often enormously simpler as error detection can be deferred until the string is interpreted.

4. If errors are preserved until display, they can be replaced with more user-friendly replacements (such as the ISO-8859-1 for each byte). This is not safe if errors must be replaced as part of data processing.

5. High-speed byte-based search algorithms work. Tables used by these would go up in size by a factor of 256^3 or more if they were rewritten to use 16-bit units.

5. For almost all real text files UTF-8 is shorter than UTF-16. This is not a big deal but some people think it is important.

Resetting PHP 6

Posted Mar 26, 2010 12:29 UTC (Fri) by ringerc (subscriber, #3071) [Link]

1. Much faster due to no need to translate on input/output

... if the surrounding systems to which I/O is done (the file system, other library APIs, network hosts, etc) are in fact using a utf-8 encoding themselves. Alas, even on many modern systems non-utf-8 encodings are very common.

2. Able to use existing apis to name files and parse text, rather than having to make an all-new api that takes "wide characters".

Not safely. The use of existing APIs with new encodings is a HUGE source of bugs in software. I've wasted vast amounts of time tracking down and fixing cases where software fails to do external->internal encoding conversion on input, fails to do internal->external encoding conversion on output, converts already-converted data (mangling it horribly by re-interpreting it as being in the wrong encoding), etc. Using utf-8 with existing encoding-agnostic APIs is a blight on software engineering. Any API should take either a properly typed argument that's specified to ONLY hold text of a known encoding - possibly single fixed encoding like utf-8, or possibly a bytes+encoding tuple structure. If it takes a raw "byte string" it should take a second argument specifying what encoding that data is in.

The fact that POSIX file systems and APIs don't care about "text" with known encoding, only "strings of bytes", is an incredible PITA. Ever had the fun of backing up a network share used by multiple hosts each of which like to use different text encodings? Ever then had to find and restore a single file within that share without knowing what encoding it was in and thus what the byte sequence of the file name was, only the "text" of the file name? ARGH.

"wide" APIs are painful, but they're more than worth it in the bugs and data corruption they prevent.

That's not to say that UTF-16 is better than UTF-8 or vice versa. Rather, "single known encoding enforced" is better than "it's just some bytes".

Resetting PHP 6

Posted Mar 26, 2010 14:41 UTC (Fri) by marcH (subscriber, #57642) [Link]

Yes: UTF-8 is a brilliant backward-compatibility hack that allows software developers to offload their homework to someone else later down the road. It's a truly admirable hack.

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Resetting PHP 6

Posted Mar 24, 2010 16:27 UTC (Wed) by amk (subscriber, #19) [Link]

"Unicode: everyone wants it, until they get it." -- Barry Warsaw of the Python developers, written when Python 1.6/2.0's Unicode support was being built.

Resetting PHP 6

Posted Mar 24, 2010 19:29 UTC (Wed) by niner (subscriber, #26151) [Link]

But Python's Unicode support is painful anyway, so those guys may not be
an adequate source of opinions...

Resetting PHP 6

Posted Mar 24, 2010 22:12 UTC (Wed) by HelloWorld (guest, #56129) [Link]

What's wrong with Python 3's Unicode support?

Resetting PHP 6

Posted Mar 26, 2010 3:51 UTC (Fri) by spitzak (guest, #4593) [Link]

Python 3 is doing the EXACT SAME STUPID MISTAKE. It is going to be a disaster and the developers are too blinded to realize it.

There will be the annoying overhead of converting every bit of data on input and output. But far more important will be the fact that errors in the UTF-8 will either be lost or will cause exceptions to be thrown, producing a whole universe of ugly bugs and DOS attacks. This is going to suck bad!

Strings should be UTF-8 and string[n] should return the n'th byte in the string. That is the TRUTH and Microsoft and Python and PHP and Java and everybody else is WRONG.

But how do I get the N'th character???? You are probably sputtering this nonsense question right now, right? You need to ask yourself: where did "N" come from? I can guarantee you it came from an iterative process that looked at every character between some other point and this new point. The proper interface to look at "characters" is ITERATORS. They can move by one in each direction in O(1) time. And different iterators can return composed or decomposed characters, and if the byte is an error they can clearly return that error and also return suggested replacement values.

Unfortunatly Unicode and UTF and perhaps some kind of politically-correct rule that we can only have equality and world peace if some people don't get the "better" shorter encodings, seems to turn quite intelligent programmers into complete morons. Or more like idiot savants: they are dangerously talented enough to write these horrible things and foist them on everybody.

Resetting PHP 6

Posted Mar 27, 2010 0:52 UTC (Sat) by jra (subscriber, #55261) [Link]

Hear hear. I merged in the original wide character support for Samba, done by the Japanese. Eventually we moved to a utf8-based solution (coded by tridge, naturally :-) with iterators for manipulating the strings. It's the only thing that makes sense.

Jeremy.

Resetting PHP 6

Posted Mar 31, 2010 16:50 UTC (Wed) by anton (subscriber, #25547) [Link]

Strings should be UTF-8 and string[n] should return the n'th byte in the string. That is the TRUTH and Microsoft and Python and PHP and Java and everybody else is WRONG.
I guess Forth does not belong to "everybody else", then, because we are going in the direction you suggest. The ideas are probably best explained in an early paper, but if you want to know where this went, look at the current (frozen) proposal.

Resetting PHP 6

Posted Mar 31, 2010 17:49 UTC (Wed) by spitzak (guest, #4593) [Link]

I strongly agree with Forth's solution. The postscript paper describes exactly how easy it was to use UTF-8 if you stop panicking about "characters" and realize that they are just like words and nobody worries that you can't find the ends of words in O(1) time. The listing of the number of lines changed should be very instructive. I hope everybody saying I am wrong might read the paper.

Forth's solution appears to have an interator return an object that they call an "xchar" which is a Unicode code point. I believe such an object is easily extended to return "UTF-8 encoding error" as a different value. You can also make different iterators to return composed or decomposed characters, and to automatically convert UTF-8 errors to CP1252 equivalents, which (though unsafe) will remove any need to "identify the character encoding" since this will reliably recognize UTF-8, ISO-8859-1, and CP1252 automatically, even if variations are pasted together.

Resetting PHP 6

Posted Mar 24, 2010 16:44 UTC (Wed) by cdamian (subscriber, #1271) [Link]

I am glad about the decision, any decision is better than no movement at all.

I also think choosing UTF-16 was wrong, most of the code, html and databases out in the real world are UTF-8 and choosing anything else is just silly.

I was at the OSCON in 2000 when Perl6 was announced and at that time I was working in London on a large Perl project and I wasn't convinced that it was a good idea. These kind of rewrites and revolutions usually take too much time and destroy your current user base if you are not careful. And while the rewrite happens the whole world keeps on moving. And if you make your users change all there code, they might as well change to a new language or system.

Some other projects trying the impossible:

- Python3 (though not a rewrite, but still slow adoption at the moment)
- typo3 5
- Doctrine 2 (small enough that it might work)
- Symfony 2 (hopefully with some migration path in the future)

Change is good, but too much change not always so.

Resetting PHP 6

Posted Mar 24, 2010 17:40 UTC (Wed) by drag (guest, #31333) [Link]

Adoption of new python releases have always been slow. I'm using Python 2.5 mostly since that is what is used in Debian unstable by default right now. Python2.6 was first released in 2008.

Of course Python2.6 and Python3 are avialable. I just try to write things so as to minimize the effort it takes to port it to newer versions, which for me is acceptable since none of it is really very complex. Larger projects are going to have larger problems, of course.

The big difference between Perl 6, PHP 6, and Python3 is that Python3 is out right now, avialable, has a bunch of transition tools, code is somewhat backwards compatible to 2.6, and it's had a couple stablizing releases.

Also it's not just about the Unicode support... Strings in Python 2.x were heavily overloaded and used for _everything_...(the major alternative being using Array module, which is a wrapper around C arrays, which end up being slower for most things then native python data types) Being forced to use data encoded into ASCII strings for everything has grown quite painful. Especially since every year less and less of your data is actually going to be stored in ASCII strings! It's all UTF-8 or binary. Having _all_ strings be unicode while introducing the byte data type is a godsend for a lot of things I need python for. Keeps things simple, clean and fast.

Backwards compatibility

Posted Mar 27, 2010 22:53 UTC (Sat) by man_ls (guest, #15091) [Link]

The big difference between Perl 6, PHP 6, and Python3 is that Python3 is out right now, avialable, has a bunch of transition tools, code is somewhat backwards compatible to 2.6, and it's had a couple stablizing releases.
But "somewhat backwards compatible" is not good enough. For any non-trivial applications you still need to test everything again, and probably do some coding + testing + deploying. In business settings it translates to money and pains; in volunteer projects just pains.

Even when backwards compatibility is a requirement, like for Java (where the rare breakages are clearly signaled and known by everyone), testing time for new versions has to be allocated. With Python migrations are a showstopper for most people unless the new version somehow provides great advantages (which for me it doesn't). For developers of the language itself and the runtime, the supposed benefits of not having to be backwards compatible are probably offset by having to support two or three versions indefinitely.

just kill it

Posted Mar 24, 2010 17:08 UTC (Wed) by b7j0c (subscriber, #27559) [Link]

actually "fixing" php would require creating a language syntax and a runtime incompatible with php5, in which case rasmus et al might as well cede the future to a better-designed general purpose scripting language like python or even javascript

as it stands, the php world is already fracturing. facebook, the most prominent user of php, is moving to their own c++ based hiphop toolchain...which effectively means they have forked php and can make language-level changes if they want. i presume the abysmal memory bloat and performance of the stock php runtime have induced this change. i doubt facebook devs even care what rasmus does at this point.

but lets not leave the language syntax out here. php's enthusiastically juvenile syntax is only appropriate for the most novice coders. everyone else with any experience rapidly hits the wall with the language. i don't even want to know how the php team would fix this, their current language syntax decisions indicate they have no business designing languages.

rasmus, its time to admit that php has reached the end of its effective life. put php5 into support mode and encourage the use of better languages with better runtimes.

just kill it

Posted Mar 24, 2010 17:45 UTC (Wed) by clump (subscriber, #27801) [Link]

rasmus, its time to admit that php has reached the end of its effective life. put php5 into support mode and encourage the use of better languages with better runtimes.
Seems a little harsh. Can you point to any examples where what you suggest has happened?

just kill it

Posted Mar 24, 2010 18:31 UTC (Wed) by b7j0c (subscriber, #27559) [Link]

i'm not sure what you mean by examples. my basic point is that fixing php would mean effectively scrapping it. considering the debacle over basic language decisions like namespaces, its clear that php5 cannot be "patched"

just kill it

Posted Mar 24, 2010 18:48 UTC (Wed) by ikm (subscriber, #493) [Link]

I think PHP began as a simple preprocessing language. In C/C++, when you have an .h file you want to include into each and every .c file, you use the preprocessor directive "#include". What if you want the same thing for html? Say, have all pages the same header or footer? That's right, put "include" or "require" statement inside of your .html, rename it to have the .php extension - and you're done. That's the actual originating use of PHP, I presume. Of course it continued from then, but I think it never really evolved into an actual programming language -- rather than that, it stated the preprocessing one.

So basically, when you have a lot of html and only need simple structuring (e.g. make them use a single header or footer), you use PHP. If you're doing something much more complex, you'd probably be better with some other language. Therefore I think php have its niche and it isn't going anywhere.

just kill it

Posted Mar 25, 2010 4:25 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

SSI already provided 'include' functionality. People wanted more features and PHP provided them, sadly without much of an overall design.

just kill it

Posted Mar 24, 2010 19:07 UTC (Wed) by jwarnica (subscriber, #27492) [Link]

There are lots of things wrong with "PHP", but syntax is hardly one of them.

PHP syntax is far more expressive then, say, Java. And far less annoying then the sigi hell of Perl.

It sucks that the core libraries seem to have random parameter ordering. It sucks that there are a lot of brain-dead php apps out there. It sucks that there are a lot of brain-dead PHP coders out there. But syntax? That is the least of PHPs problems.

just kill it

Posted Mar 24, 2010 22:07 UTC (Wed) by elanthis (guest, #6227) [Link]

The PHP syntax sucks, and I say that as someone who has worked with it
professionally since 2000. Saying that some other languages are worse is
irrelevant; by that kind of reasoning, PHP is perfect in every way because
some other language out there has surely done it all even worse.

PHP 5.3 namespace syntax? The fact that it _just_ finally got real
closures? Unknown identifiers are treated as strings? Function calls
can't be used in any general expression (e.g., foo()->bar() does not work,
but $tmp = foo(); $tmp->bar() does) ? No pass-by-name function parameters
(for no logical reason, just "it's not the PHP way") ? Type hinting for
objects and arrays only? Inconsistent special operator names?

Granted, you're right, the syntax is not PHP's biggest problem. The entire
implementation is its biggest problem. The compiler is crap, buggy,
unpredictable, and can't deal with any kind of failure (no way to catch and
gracefully fail on a parse error when including another file, for example).
The language is slow. The C API is hideous. Many of those syntax warts
are all but forced by the internal implementation, which itself grew out of
a lack of abstraction between the crappy original syntax and the runtime
engine.

If I were to do a PHP 6, it would be just a cleanup of the internals, all
deprecated APIs removed, and something very clean and easy to build upon so
that 6.1, 6.2, 6.3, etc. can start delivering at a higher quality.

That is basically what PHP 5 ended up being, and look how much life that
breathed into PHP.

just kill it

Posted Mar 24, 2010 19:19 UTC (Wed) by robla (subscriber, #424) [Link]

"...their current language syntax decisions indicate they have no business designing languages.

In your world, who should decide who gets to design programming languages?

just kill it

Posted Mar 25, 2010 5:11 UTC (Thu) by b7j0c (subscriber, #27559) [Link]

anyone else

I guess I'm just not so cavalier....

Posted Mar 25, 2010 5:42 UTC (Thu) by robla (subscriber, #424) [Link]

...about trash-talking other people's hard work. PHP allowed a lot of
people to start programming who might not have ever gotten slurped into
programming. The apps created by those new developers (e.g. Wordpress,
MediaWiki, Drupal, etc) are applications that are running some of the
highest traffic websites in the world. While those applications could have
been written by "real" programmers in "good" programming languages, the
fact of the matter is that the "real" programmers just didn't have the time
or inclination to write those apps. So, now we have a lot of great
applications that may not be so pretty under the hood, but they often still
do the job better than any other application out there, and may not have
otherwise existed had PHP not been around.

Want to see something kinda funny? Check out http://www.haskell.org/ I
doubt you're going to find too many people out there that would lump
Haskell in with the "bad" languages. In fact, it could very well be
something a lot of us are writing production apps in a decade from now.
However, guess what they're using to host haskell.org...:
http://www.haskell.org/haskellwiki/Special:Version

I guess I'm just not so cavalier....

Posted Mar 25, 2010 14:22 UTC (Thu) by Simetrical (guest, #53439) [Link]

Please don't cite MediaWiki as an example of a great app written in PHP.
Pretty much all of us MediaWiki developers hate the language passionately
and wish we were using something else. (Although, not all of us agree on
what that something else should be.)

PHP is *not* easier to learn than, say, Python. That's just not true IMO.
And it's definitely not true that MediaWiki wouldn't have existed if not for
PHP. phase2 was written in Perl, IIRC, and it was a couple of people's
decision to pick PHP for phase3 -- it would have been written either way.

I guess I'm just not so cavalier....

Posted Mar 25, 2010 18:21 UTC (Thu) by robla (subscriber, #424) [Link]

Sorry, I'd forgotten about the UseModWiki days, though I'm going to bet that the anti-PHP crowd here doesn't really have a higher opinion of Perl. At any rate, Wordpress and Drupal still qualify, and there's a ton of other really useful software that falls into that category.

The thing that PHP has historically had going for it is mod_php, which was for a very long time way better than mod_perl and mod_python. It had the added benefit of being turned on by default in many contexts (e.g. cheap web hosts). That sort of availability made web programming a lot more accessible to a lot more people. That's not really a triumph of language design so much as interpreter design, but I do find it peculiar that Perl and Python couldn't beat PHP in this area, given the long headstarts they had.

Speaking of interpreter design, I think Python's Global Interpreter Lock is something that bears every bit as much scrutiny as any of PHP's deficiencies. While I'm not interested in starting a PHP vs Python flamewar (I happen to be primarily programming in Python these days), I think this just goes to show you that there's always tradeoffs in picking languages.

I guess I'm just not so cavalier....

Posted Mar 25, 2010 19:58 UTC (Thu) by Simetrical (guest, #53439) [Link]

If PHP hadn't existed, web hosts would be using something else instead. Probably something based on Unix permissions instead of things like open_basedir and max_timeout that try to enforce permissions or resource limits in userspace, thereby prohibiting perfectly sane things like shelling out to other programs.

If web hosts used something else, web apps would be written in something else. It's that simple. Wordpress and Drupal are web apps that happen to be written in PHP, not consequences of PHP's existence. I'd bet that they're written in PHP because that's how you reach the largest audience, because that's what webhosts use.

The Python GIL is a nonissue if you're running single-threaded code. Does PHP support multithreaded execution at *all*?

I guess I'm just not so cavalier....

Posted Mar 25, 2010 21:18 UTC (Thu) by foom (subscriber, #14868) [Link]

> The Python GIL is a nonissue if you're running single-threaded code.

Not really...Python doesn't properly support multiple distinct interpreters within a process -- you
can do it, but they aren't properly isolated from each-other. One important way they aren't isolated:
they all share the same GIL. So, you can't even properly run multiple single-threaded python
interpreters within a multithreaded process. It works, but only one thread can actually run at a time,
across all interpreters.

So of course that means you can't run python (efficiently) within a threaded apache.

I guess I'm just not so cavalier....

Posted Mar 25, 2010 22:31 UTC (Thu) by Simetrical (guest, #53439) [Link]

As far as I know, many/most PHP modules don't work at all with a threaded
Apache, and it's generally advised that mod_php users stick to prefork or
FastCGI. So this isn't a big advantage for PHP.

I guess I'm just not so cavalier....

Posted Mar 25, 2010 23:29 UTC (Thu) by JoeF (guest, #4486) [Link]

Does PHP support multithreaded execution at *all*?

Yes, it does.
But a large part of the third-party modules are not thread-safe, so unless you limit yourself to what you can run (and test the hell out of things), you are better off not running a multithreaded build of Apache.

I guess I'm just not so cavalier....

Posted Mar 26, 2010 13:51 UTC (Fri) by foom (subscriber, #14868) [Link]

I've kinda wondered how exaggerated this problem is. I mean, the default on windows is threaded -
- do most modules blow up by default on windows? That seems like a problem that their authors
would want to fix.

I guess I'm just not so cavalier....

Posted Mar 25, 2010 17:12 UTC (Thu) by b7j0c (subscriber, #27559) [Link]

you're getting offtopic

is it possible to write a great app in php? yes

have some people done it? yes

do most people who have done it want to consign php to the dustbin of history? YES

just kill it

Posted Mar 25, 2010 14:25 UTC (Thu) by Simetrical (guest, #53439) [Link]

Wikipedia is one of the other biggest users of PHP, and we're probably going
to move to Hiphop when it matures somewhat. Domas Mitzuas, volunteer
performance engineer for Wikipedia for the last several years, also happens
to be a DBA at Facebook. Since MediaWiki is used so widely, though, we'd
have to still be compatible with stock PHP.

(This is just a guess, though, not any official statement -- I have nothing
to do with Wikipedia systems administration, only MediaWiki development.)

Resetting PHP 6

Posted Mar 24, 2010 18:35 UTC (Wed) by ikm (subscriber, #493) [Link]

UCS-4 (some call it UTF-32) allows random access to individual code points, but this access isn't really always needed, and the waste is great. UTF-16 has none of the advantages of UTF-8, but all of its disadvantages. It seems logical therefore to operate almost solely on UTF-8. For that, the language should have utf8 string iterators, store string's logical length, and so on. Problem is, to make sure no programmers' errors slip through, one should exclude any support for direct 8-bit string manipulations from it. You may not e.g. be able to cut such strings at arbitrary 8-bit boundaries, and shouldn't even know their 8-bit sizes. The string would then actually feel like an UCS-4 string -- only without random access. This feels quite limiting, but I think that would still be the right approach. If an 8-bit string is needed, there should be ways to convert/project -- but the distinction must be stark. If, on the other hand, direct random access to UCS-4 data is required, the string could temporarily convert itself to UCS-4 under the hood, and then later shrink back to UTF-8.

This would look like the right approach to me.

Resetting PHP 6

Posted Mar 24, 2010 19:26 UTC (Wed) by mrshiny (subscriber, #4266) [Link]

Just want to point out that a useful language will always have ways for programmers to screw up character encodings. In Java a char is distinct from a byte, and yet people do someString.getBytes("UTF-8") (to get the bytes when utf-8 encoded) then proceed to treat each byte as if it represents a letter. Since you can't take away the ability to write character data into an arbitrary encoding, you can't take away this particular failure mode. Character encodings should be taught in school as an abject lesson in the consequences of data storage decisions.

Resetting PHP 6

Posted Mar 26, 2010 4:02 UTC (Fri) by spitzak (guest, #4593) [Link]

You are seriously overestimating the damage of "cutting a string at an arbitrary byte".

First of all, the primary thing that happens in real programs is that the halves of the string get pasted back together, such as when fixed-sized blocks are copied from one file to another. That does not destroy UTF-8 at all.

Second, why is breaking a "character" really such a disaster? Why are we not worried about breaking "words"? If I split a english word in half I will probably get two non-words. How can I possibly safely use a computer language that allows such things? Why it seems hard to believe that word processors could be written when the computer would allow this horrible abilty! /sarcasm

Worrying about "breaking characters" is actually stupid, and is being used as an excuse to defend the bone-headed decision to use "wide characters".

Resetting PHP 6

Posted Mar 26, 2010 9:51 UTC (Fri) by ikm (subscriber, #493) [Link]

> First of all, the primary thing that happens in real programs is that the halves of the string get pasted back together

No, your example doesn't count -- this isn't string splitting, your resulting strings are intact there. The primary thing that happens in real programs is that they try to shorten the string, e.g. make "A very long string" into something like "A very lo...", to squeeze it in e.g. a fixed space of 12 characters, or do similar transformations. Those transformations can't be done correctly on raw 8-bit utf-8 strings.

> why is breaking a "character" really such a disaster? Why are we not worried about breaking "words"?

Because you're breaking the underlying encoding of the characters, not the characters itself. The resulting bitstream would be an invalid utf-8 sequence. Parts of english words you split would be rendered intact just fine, but damaged and invalid utf-8 would either result in no display at all, or in program/library barf. You can safely combine valid utf-8 sequences together, but you can't arbitrarily cut them and expect the result to be valid.

> Worrying about "breaking characters" is actually stupid, and is being used as an excuse to defend the bone-headed decision to use "wide characters".

As a Russian, I actually know how important this is. I've seen enough non-utf8 aware programs and observed enough of their horrendous problems to understand the importance of wide characters. What makes you so bold in your statements? You seem to know nothing about the topic.

Two for two!!

Posted Mar 24, 2010 19:36 UTC (Wed) by dskoll (subscriber, #1630) [Link]

We develop a commercial piece of software using (primarily) Perl and PHP. It seems we've successfully jinxed both of them! :-)

/me mutters something about "should've stuck to C...."

Resetting PHP 6

Posted Mar 25, 2010 5:54 UTC (Thu) by branden (guest, #7029) [Link]

Unless it can rationalize its BS licensing, PHP should die. I've been boycotting the language since version 4 came out (with the--OOOH!---"Zend Engine") and see no reason to stop.

Resetting PHP 6

Posted Mar 25, 2010 10:51 UTC (Thu) by djzort (guest, #57189) [Link]

so is perl 6 likely be released first?

Resetting PHP 6

Posted Mar 25, 2010 16:58 UTC (Thu) by JoeF (guest, #4486) [Link]

Only after Duke Nukem Forever is released ;-)

Resetting PHP 6

Posted Mar 26, 2010 9:52 UTC (Fri) by ikm (subscriber, #493) [Link]

It was officially cancelled.

Resetting PHP 6

Posted Mar 27, 2010 12:13 UTC (Sat) by HelloWorld (guest, #56129) [Link]

It wasn't. On
http://www.shacknews.com/onearticle.x/61747
it says:
"we've never said that Duke Nukem Forever has ceased development,"

Resetting PHP 6

Posted Mar 25, 2010 17:15 UTC (Thu) by chromatic (guest, #26207) [Link]

Perl 6 exists and is available today.

Rakudo (one of several Perl 6 implementations) had its 27th release last week. Rakudo also shipped for the first time in Fedora 12.

Resetting PHP 6

Posted Mar 26, 2010 13:20 UTC (Fri) by Darkmere (subscriber, #53695) [Link]

I'll believe it when I can go to perl.org and see "current version" being something other than 5.xx.x , perhaps even perl 6.0.0.

Until then, Perl is at 5.x.

Rakudo is something different, a Perl-like language, perhaps a steppingstone for future Perl technology. But it isn't Perl 6.0 to this member of the audience. It is Rakudo. Not Perl.

Resetting PHP 6

Posted Mar 26, 2010 19:05 UTC (Fri) by chromatic (guest, #26207) [Link]

Aren't you making an ontological argument (Perl 6 doesn't exist, because it hasn't been released, because the text on a website says that Perl 5.10.1 is the current version of Perl) based on a definitional fallacy (you will believe that Rakudo is a Perl 6 implementation when the text on a specific website changes)?

Perl.com didn't mention Perl 5.10.1 for several months. Which has precedence, perl.org or perl.com? Which has precedence with regard to Perl 6, perl.org or perl6.org?

I can understand that you don't want to download or use a Perl 6 implementation such as Rakudo until it meets certain criteria, and I can understand that a big shiny Download Now button is such a criterion for certain classes of users, but I don't understand how an HTML change to add a download button somehow flips the switch from "The software does not exist as its developers claim it does" to "Oh, now it really exists," at least for a project which isn't itself solely a download button.

Resetting PHP 6

Posted Mar 27, 2010 20:06 UTC (Sat) by bronson (subscriber, #4806) [Link]

There's a difference between "available for use as an experiment" and "available for use as Perl." If perl.org doesn't link to Perl6 from its home page, then one would guess that Perl6 isn't available for general use.

And one would be right.

No need to get all insulty with big shiny download buttons.

If I had mod points, I'd give you one.

Posted Apr 15, 2010 10:45 UTC (Thu) by qu1j0t3 (guest, #25786) [Link]

Well said.

Resetting PHP 6

Posted Mar 31, 2010 8:48 UTC (Wed) by roerd (guest, #64880) [Link]

> Rakudo is something different, a Perl-like language, perhaps a steppingstone for future Perl technology. But it isn't Perl 6.0 to this member of the audience. It is Rakudo. Not Perl.

By that definition there will never be a Perl 6.0, because Perl 6 is a specification, not an implementation. Though of course you're right that at this time Rakudo can't be an implementation of Perl 6.0, because the specification is still a moving target.

Resetting PHP 6

Posted Mar 31, 2010 9:51 UTC (Wed) by Darkmere (subscriber, #53695) [Link]

Indeed, and this makes me quite sad. Because really, it feels as if Perl slipped off the map and into la-la-land. Not of the Duke Nukem Forever-style, but by setting the system up to a situation where you cannot deliver Perl6, because it's some immaterial beast that has yet to be able to exist.

Resetting PHP 6

Posted Mar 25, 2010 17:16 UTC (Thu) by b7j0c (subscriber, #27559) [Link]

well the development version of perl6 can be used right now

python3 and perl6 are both a ways off from full adoption by their own communities, but people are using these tools right now, and the "stable" versions of each tool (python2.x and perl5.x) are also active.

you can't compare php6 to these tools. php6 has essentially lost five years of effort

Resetting PHP 6

Posted Mar 27, 2010 0:37 UTC (Sat) by cmccabe (guest, #60281) [Link]

ln -s /usr/bin/ruby /usr/bin/php6

Problem solved; who's up for lunch?

Resetting PHP 6

Posted Jun 15, 2011 7:16 UTC (Wed) by nivas (guest, #75700) [Link]

Hi, When it will be released?


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds