unprintable characters in a javascript produced msgbox
|
|
Thread rating:  |
emrefan - 02 Jul 2008 08:42 GMT I am wondering a bit about what I should see in a message box (or in a webpage, for that matter) when I include an unprintable ASCII character, say ASCII 255, in there. I experimented a bit on my PC running Traditional Chinese Windows 98SE and found that the following javascript code produced a message that seemed to have ASCII represented as "y".
alert( 'the following char is ASCII FF: \xff. So what does it look like to you?' );
I had this line in the <HEAD> section of the relevant HTML file where I put that javascript code:
<meta http-equiv='Content-Type' content='text/html; charset=Big5- HKSCS'>
But even if I try to figure that into the picture, I still can't see why it should come out as "y".
Can anybody please enlighten this thick mind?
Thomas 'PointedEars' Lahn - 02 Jul 2008 09:29 GMT > I am wondering a bit about what I should see in a message box (or in a > webpage, for that matter) when I include an unprintable ASCII character, > say ASCII 255, in there. The (7-bit US-)ASCII character set ranges from code points 0 (0x00) to 127 (0x7F). Everything else is _not_ part of (US-)ASCII code:
<http://en.wikipedia.org/wiki/ASCII>
> I experimented a bit on my PC running Traditional Chinese Windows 98SE > and found that the following javascript code produced a message that > seemed to have ASCII represented as "y".char You are getting the LATIN SMALL LETTER Y WITH DIAERESIS character ("ÿ"; note that there are two dots in the ascent) because this is the character at code point U+00FF in the Unicode character set as defined in the Unicode Standard, versions 2.1 and later (a conforming implementation of ECMAScript Edition 3 must implement the latter), and at code point 255 (0xFF) of several other character sets, most notably ISO/IEC 8859-1 and Windows-1252:
<http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Related_character_maps> <http://unicode.org/> <http://www.ecmascript.org/>
> alert( 'the following char is ASCII FF: \xff. So what does it look like > to you?' ); Should be window.alert(...) so as to rely less on the UA's scope chain.
> I had this line in the <HEAD> section of the relevant HTML file where I > put that javascript code: [quoted text clipped - 3 lines] > But even if I try to figure that into the picture, I still can't see why > it should come out as "y". The display behavior for the code point 0xFF of the *proposed* character encoding Big5-HKSCS (which uses the Big5 Character Set with Hong Kong Supplementary Character Set), even if written properly, is undefined:
<http://en.wikipedia.org/wiki/Big5#HKSCS> <http://www.iana.org/assignments/charset-reg/>
You should also check the HTTP response message's headers for a `Content-Type' header that says differently, for it takes precedence then:
<http://www.w3.org/TR/1999/REC-html401-19991224/charset.html#h-5.2.2>
HTH
PointedEars
 Signature Anyone who slaps a 'this page is best viewed with Browser X' label on a Web page appears to be yearning for the bad old days, before the Web, when you had very little chance of reading a document written on another computer, another word processor, or another network. -- Tim Berners-Lee
Bart Van der Donck - 02 Jul 2008 09:37 GMT > I am wondering a bit about what I should see in a message box (or in a > webpage, for that matter) ... Character encoding in message boxes or web pages are two totally different things.
> ... when I include an unprintable ASCII character, say ASCII 255, > in there. Code points above 127 are not ASCII anymore. And why would it be unprintable ?
> I experimented a bit on my PC running Traditional Chinese Windows > 98SE and found that the following javascript code produced a > message that seemed to have ASCII represented as "y". Google Groups probably replaced your "y-umlaut" by "y".
> alert( 'the following char is ASCII FF: \xff. So what does it > look like to you?' ); This always looks the same for everyone, namely a y with an umlaut on. No other display is possible here.
> I had this line in the <HEAD> section of the relevant HTML file where > I put that javascript code: > > <meta http-equiv='Content-Type' content='text/html; charset=Big5- > HKSCS'> That line does not affect javascript's internal code point table (like eg. \xff). It defines which character set must be used on the web page. For displaying y-umlaut on a web page, you probably want:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
If you want both ISO-8859-1 and Chinese on a same page, I would definitely go for UTF-8.
> But even if I try to figure that into the picture, I still can't see > why it should come out as "y". Because you get what you define :-) If you say ISO-8859-1, then the browser ties code point 255 to y-umlaut. If you say ISO-8859-2, then you get an upper dot, etc. http://en.wikipedia.org/wiki/ISO_8859-1 http://en.wikipedia.org/wiki/ISO_8859-2
Hope this helps,
-- Bart
Thomas 'PointedEars' Lahn - 02 Jul 2008 09:48 GMT >> I am wondering a bit about what I should see in a message box (or in a >> webpage, for that matter) ... > > Character encoding in message boxes or web pages are two totally > different things. Not true.
>> alert( 'the following char is ASCII FF: \xff. So what does it >> look like to you?' ); > > This always looks the same for everyone, namely a y with an umlaut on. > No other display is possible here. You are mistaken. The \x string literal escape sequence may or may not specify a Unicode character, depending on the ECMAScript implementation.
>> I had this line in the <HEAD> section of the relevant HTML file where >> I put that javascript code: [quoted text clipped - 4 lines] > That line does not affect javascript's internal code point table (like > eg. \xff). It could affect it if there was no corresponding HTTP header present that says otherwise. There is no "javascript", BTW.
> It defines which character set must be used on the web page. Unless a corresponding HTTP header is present that says otherwise. There are no "web pages", BTW.
PointedEars
 Signature var bugRiddenCrashPronePieceOfJunk = ( navigator.userAgent.indexOf('MSIE 5') != -1 && navigator.userAgent.indexOf('Mac') != -1 ) // Plone, register_function.js:16
Bart Van der Donck - 02 Jul 2008 10:28 GMT >> Character encoding in message boxes or web pages are two totally >> different things. > > Not true. It is true, because the character encoding is done at a different level. Message boxes -like in this example- are actually much easier. There can only be one possible representation. But when trying to write y-umlaut in a web page, you have a bunch of possibilities, on the top of my head, at least 10 - for which of course some are more preferred than others.
>>> alert( 'the following char is ASCII FF: \xff. So what does it >>> look like to you?' ); [quoted text clipped - 4 lines] > You are mistaken. The \x string literal escape sequence may or may not > specify a Unicode character, depending on the ECMAScript implementation. But I was only saying that alert('\xff') always shows y-umlaut in any browser. y-umlaut is the character that is tied to code point 255 in any ECMAScript implementation.
>>> <meta http-equiv='Content-Type' content='text/html; charset=Big5- >>> HKSCS'> [quoted text clipped - 4 lines] > It could affect it if there was no corresponding HTTP header present that > says otherwise. Untrue. The display of \x.. (and \u....) can never be influenced by any HTTP-header. The notation is ASCII-safe, and is passed to the javascript engine to tie it to a fixed character. I think you're mixing up the character set of a web page with javascript's consistent internal code point table.
> There is no "javascript", BTW. Is that so.
>> It defines which character set must be used on the web page. > > Unless a corresponding HTTP header is present that says otherwise. That is far from sure, and could easily vary from browser to browser. Anyway - it would be unwise to specify a charset on the web page that contradicts the HTTP header (coder's fault, not browser's fault).
> There are no "web pages", BTW. Is that so :-)
-- Bart
Thomas 'PointedEars' Lahn - 02 Jul 2008 10:49 GMT >>> Character encoding in message boxes or web pages are two totally >>> different things. [quoted text clipped - 3 lines] > Message boxes -like in this example- are actually much easier. There can > only be one possible representation. You are mistaken. It depends on the user agent which characters are supported in a message box. However, it has been observed that message boxes use the character set of their document, regardless of the encoding that the ECMAScript implementation supports. We have discussed this here before.
> But when trying to write y-umlaut in a web page, you have a bunch of > possibilities, on the top of my head, at least 10 - for which of course > some are more preferred than others. I don't think the OP wanted to write "y-umlaut" at all.
>>>> alert( 'the following char is ASCII FF: \xff. So what does it look >>>> like to you?' ); [quoted text clipped - 6 lines] > But I was only saying that alert('\xff') always shows y-umlaut in any > browser. But you are dead wrong.
> y-umlaut is the character that is tied to code point 255 in any > ECMAScript implementation. However, there are implementations that do not support Unicode.
>>>> <meta http-equiv='Content-Type' content='text/html; charset=Big5- >>>> HKSCS'> [quoted text clipped - 5 lines] > Untrue. The display of \x.. (and \u....) can never be influenced by any > HTTP-header. \x definitely can. Obviously, \u cannot.
> The notation is ASCII-safe, \x cannot be ASCII-safe as if it allows characters to be represented that are outside the range of the ASCII character set.
>> There is no "javascript", BTW. > > Is that so. Yes, there are different ECMAScript implementations (some of which don't even deserve that designation), and versions thereof.
>>> It defines which character set must be used on the web page. >> Unless a corresponding HTTP header is present that says otherwise. > > That is far from sure, and could easily vary from browser to browser. It has been observed that user agents honor the Specification in that regard. This was the reason why AddDefaultCharset was disabled in newer Apache versions.
> Anyway - it would be unwise to specify a charset on the web page that > contradicts the HTTP header (coder's fault, not browser's fault). Nowadays, no argument there.
PointedEars
 Signature Use any version of Microsoft Frontpage to create your site. (This won't prevent people from viewing your source, but no one will want to steal it.) -- from <http://www.vortex-webdesign.com/help/hidesource.htm>
Bart Van der Donck - 02 Jul 2008 11:29 GMT >>>> Character encoding in message boxes or web pages are two totally >>>> different things. [quoted text clipped - 9 lines] > that the ECMAScript implementation supports. We have discussed this here > before. That is not the point here. It is clear that the original poster was talking about alert('\xff') versus the encoding of y-umlaut in an HTML- document. In that regard the representation of \xff has nothing to do with the representation of y-umlaut outside javascript.
[...]
>> But I was only saying that alert('\xff') always shows y-umlaut in any >> browser. > > But you are dead wrong. Well, let's see then. Could you show a case where alert('\xff') does not show y-umlaut ?
>> y-umlaut is the character that is tied to code point 255 in any >> ECMAScript implementation. > > However, there are implementations that do not support Unicode. Irrelevant. y-umlaut does not need Unicode at all.
>> The display of \x.. (and \u....) can never be influenced by any >> HTTP-header. > > \x definitely can. Obviously, \u cannot. Let's see. Could you show an example where \x.. is displayed differently depending on a varying HTTP-header ?
>> The notation is ASCII-safe, > > \x cannot be ASCII-safe as if it allows characters to be represented that > are outside the range of the ASCII character set. That's why I said the *notation* is ASCII-safe. What is *represented* by that notation, is a different job; that is decided by the javascript engine.
>>> There is no "javascript", BTW. >> Is that so. > > Yes, there are different ECMAScript implementations (some of which don't > even deserve that designation), and versions thereof. That's like saying that cars don't exist, but only implementations of fuel engines.
-- Bart
Thomas 'PointedEars' Lahn - 02 Jul 2008 12:36 GMT >>>>> Character encoding in message boxes or web pages are two totally >>>>> different things. [quoted text clipped - 12 lines] > document. In that regard the representation of \xff has nothing to do > with the representation of y-umlaut outside javascript. Yes, it has.
> [...] >>> But I was only saying that alert('\xff') always shows y-umlaut in any [quoted text clipped - 3 lines] > Well, let's see then. Could you show a case where alert('\xff') does > not show y-umlaut ? Wasting my time supporting your logical fallacy? I don't think so.
Ask something living in Bosnia, Croatia, Czech Republic, Hungaria, Poland, Romania, Serbia, Slovakia, Slovenia, Malta, Estonia, Latvia, Lithuania, Greenland, Bulgaria, Belarus, Russia, Macedonia, Greece, Israel, or any other country where the character set designed for their main language does not have "y-umlaut", as you put it (you really don't know what an umlaut is), at decimal code point 255 (*except* with Unicode support), instead.
>>> y-umlaut is the character that is tied to code point 255 in any >>> ECMAScript implementation. >> However, there are implementations that do not support Unicode. > > Irrelevant. Not at all.
> y-umlaut does not need Unicode at all. True, it is also contained in ISO-8859-1. However, as ASCII does not provide this character, if the \x string escape sequence is used and Unicode support is not present, the locale encoding (or the encoding of the document/file) must be used to determine which character to display for decimal code points beyond 127. (If Unicode is not supported, "\uhhhh" is interpreted as "uhhhh".)
>>> The notation is ASCII-safe, >> \x cannot be ASCII-safe as if it allows characters to be represented that >> are outside the range of the ASCII character set. > > That's why I said the *notation* is ASCII-safe. It would seem whether that is true depends on how one defines "ASCII-safe".
> What is *represented* by that notation, is a different job; that is > decided by the javascript engine. See?
>>>> There is no "javascript", BTW. >>> Is that so. [quoted text clipped - 3 lines] > That's like saying that cars don't exist, but only implementations of > fuel engines. As a matter of fact, there are JavaScript and JScript versions that are not fully ECMAScript-compliant, and therefore do not provide Unicode support.
PointedEars
 Signature Prototype.js was written by people who don't know javascript for people who don't know javascript. People who don't know javascript are not the best source of advice on designing systems that use javascript. -- Richard Cornford, cljs, <f806at$ail$1$8300dec7@news.demon.co.uk>
Bart Van der Donck - 02 Jul 2008 13:43 GMT >> Could you show a case where alert('\xff') does >> not show y-umlaut ? [quoted text clipped - 7 lines] > not have "y-umlaut", as you put it (you really don't know what an umlaut > is), at decimal code point 255 (*except* with Unicode support), instead. You are simply wrong; all of those will display y-umlaut with alert('\xff'). You keep talking about Unicode but it has nothing to do with it. As I said, just give me one example, and I'll be immediately convinced of your point. But there is no such example.
>>>> y-umlaut is the character that is tied to code point 255 in any >>>> ECMAScript implementation. [quoted text clipped - 11 lines] > document/file) must be used to determine which character to display for > decimal code points beyond 127. You just wrote the core of your misconception. In the (nowadays highly unlikely) case that Unicode support would not be present in the browser's script engine, the locale is NOT used as lookup-table for \x. It's always the internal lookup table of the script engine. It has nothing to do with the document or its encoding !
[...]
>> That's why I said the *notation* is ASCII-safe. > It would seem whether that is true depends on how one defines "ASCII-safe". You have the nasty habit to give a silly twist to a position that you cannot longer hold. ASCII-safe is code-point 0 to 127, as you perfectly know. There is no room for other interpretations.
>> What is *represented* by that notation, is a different job; that is >> decided by the javascript engine. > > See? See what then ?
>>>>> There is no "javascript", BTW. >>>> Is that so. [quoted text clipped - 4 lines] > As a matter of fact, there are JavaScript and JScript versions that are not > fully ECMAScript-compliant, and therefore do not provide Unicode support. I'm not going to reply on your arguments like "there is no javascript", "you don't know what an umlaut is", "web pages don't exist", etc. I made my point clear enough. You already conveniently snipped my question "Could you show an example where \x.. is displayed differently depending on a varying HTTP-header" which was one of your basic points.
-- Bart
Thomas 'PointedEars' Lahn - 02 Jul 2008 16:11 GMT >>> Could you show a case where alert('\xff') does >>> not show y-umlaut ? [quoted text clipped - 6 lines] >> not have "y-umlaut", as you put it (you really don't know what an umlaut >> is), at decimal code point 255 (*except* with Unicode support), instead. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> You are simply wrong; all of those will display y-umlaut with > alert('\xff'). No, they won't. ISO-8859-2, for example, does not have y with diaerhesis at code point 255. Neither has ISO-8859-3 or any other ISO-8859-x family encoding but ISO-8859-11. And I am not even mentioning more exotic character sets and encodings.
> You keep talking about Unicode but it has nothing to do with it. You are mistaken, and I'm tired explaining to you why. There is *nothing* in the ECMAScript Specification that specifies what should happen with \x escape sequences if Unicode support is not there, because ECMAScript Ed. 1 already introduced Unicode support. However, as we know that there are JavaScript and JScript versions that are not ECMAScript-compliant, that therefore don't have Unicode support or the operating system's API they are running on is not Unicode-compliant, it is locale/encoding-dependent what happens with \x80 to \xFF then.
> As I said, just give me one example, and I'll be immediately > convinced of your point. But there is no such example. As I indicated, you are trying to shift the burden of proof and I will not support that.
>>>>> y-umlaut is the character that is tied to code point 255 in any >>>>> ECMAScript implementation. [quoted text clipped - 13 lines] > browser's script engine, the locale is NOT used as lookup-table for > \x. It's always the internal lookup table of the script engine. There is no "internal lookup table of the script engine", that is a fantasy of yours. window.alert() especially, is a host object's method which behavior is defined by the UA's API.
> It has nothing to do with the document or its encoding ! If that were so, it would be *you* who would have to prove *that*, not vice-versa.
PointedEars
 Signature Use any version of Microsoft Frontpage to create your site. (This won't prevent people from viewing your source, but no one will want to steal it.) -- from <http://www.vortex-webdesign.com/help/hidesource.htm>
Bart Van der Donck - 02 Jul 2008 18:10 GMT >> ... >> You are simply wrong; all of those will display y-umlaut with [quoted text clipped - 4 lines] > encoding but ISO-8859-11. And I am not even mentioning more exotic > character sets and encodings. The character set doesn't matter. \x always works with Latin-1 (=ISO 8859-1), regardless of the character set of the web page.
>> You keep talking about Unicode but it has nothing to do with it. > [quoted text clipped - 6 lines] > running on is not Unicode-compliant, it is locale/encoding-dependent what > happens with \x80 to \xFF then. What you write in this paragraph is correct, except your last conclusion (after your last comma). When the ECMAScript Specification says nothing about \x outside of Unicode, you should obviously look at the javascript docs themselves.
>> As I said, just give me one example, and I'll be immediately >> convinced of your point. But there is no such example. > > As I indicated, you are trying to shift the burden of proof and I will not > support that. Here is the proof.
(1) The documentation says:
| \xXX: The character with the Latin-1 encoding specified by the | two hexadecimal digits XX between 00 and FF.
http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Literals
(2) Verify in browser without Unicode support:
I have installed Netscape 2.0 from: http://netscape.1command.com/client_archive20x.php
Give it the following code:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2"> </head> <body> <script type="text/javascript"> alert('\xff') </script> </body> </html>
The outcome is expected and documented as in (1). Screenshot: http://www.dotinternet.be/temp/demo.jpg
I gave you proof from the docs (developer.mozilla.org) plus a demonstration with Netscape 2. Hopefully this will convince you now.
-- Bart
Bart Van der Donck - 02 Jul 2008 14:27 GMT > if the \x string escape sequence is used and Unicode support is > not present, the locale encoding (or the encoding of the > document/file) must be used to determine which character to > display for decimal code points beyond 127. I believe much of your arguments in this thread were based on this false pre-assumption. A small investigation:
| \xXX: The character with the Latin-1 encoding specified by the | two hexadecimal digits XX between 00 and FF. For example, \xA9 | is the hexadecimal sequence for the copyright symbol. [*]
In your view, this \xA9 would then become Latin S with Caron [**] under a Latin-2 [***] locale. This is not true, as \x uses its own independent lookup-table, namely Latin-1 [*]. \x originates from the same time as other pre-Unicode instructions like escape() or unescape(), which also use Latin-1.
| It [ISO-8859-1] is less formally referred to as Latin-1. [****]
[*] http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Literals [**] http://en.wikipedia.org/wiki/%C5%A0 [***] http://en.wikipedia.org/wiki/ISO_8859-2 [****] http://en.wikipedia.org/wiki/ISO_8859-1
-- Bart
Thomas 'PointedEars' Lahn - 02 Jul 2008 12:41 GMT >>>>> Character encoding in message boxes or web pages are two totally >>>>> different things. [quoted text clipped - 12 lines] > document. In that regard the representation of \xff has nothing to do > with the representation of y-umlaut outside javascript. Yes, it has.
> [...] >>> But I was only saying that alert('\xff') always shows y-umlaut in any [quoted text clipped - 3 lines] > Well, let's see then. Could you show a case where alert('\xff') does not > show y-umlaut ? Wasting my time supporting your logical fallacy? I don't think so.
Ask someone living in Bosnia, Croatia, Czech Republic, Hungaria, Poland, Romania, Serbia, Slovakia, Slovenia, Malta, Estonia, Latvia, Lithuania, Greenland, Bulgaria, Belarus, Russia, Macedonia, Greece, Israel, or any other country where the character set designed for their main language does not have "y-umlaut", as you put it (you really don't know what an umlaut is), at decimal code point 255 (*except* with Unicode support), instead.
>>> y-umlaut is the character that is tied to code point 255 in any >>> ECMAScript implementation. >> However, there are implementations that do not support Unicode. > > Irrelevant. Not at all.
> y-umlaut does not need Unicode at all. True, it is also contained in ISO-8859-1. However, as ASCII does not provide this character, if the \x string escape sequence is used and Unicode support is not present, the locale encoding (or the encoding of the document/file) must be used to determine which character to display for decimal code points beyond 127. (If Unicode is not supported, "\uhhhh" is interpreted as "uhhhh" rather than a single character.)
>>> The notation is ASCII-safe, >> \x cannot be ASCII-safe as if it allows characters to be represented >> that are outside the range of the ASCII character set. > > That's why I said the *notation* is ASCII-safe. It would seem whether that is true depends on how one defines "ASCII-safe".
> What is *represented* by that notation, is a different job; that is > decided by the javascript engine. See?
>>>> There is no "javascript", BTW. >>> Is that so. [quoted text clipped - 3 lines] > That's like saying that cars don't exist, but only implementations of > fuel engines. As a matter of fact, there are JavaScript and JScript versions that are not fully ECMAScript-compliant, and therefore do not provide Unicode support.
PointedEars
 Signature Prototype.js was written by people who don't know javascript for people who don't know javascript. People who don't know javascript are not the best source of advice on designing systems that use javascript. -- Richard Cornford, cljs, <f806at$ail$1$8300dec7@news.demon.co.uk>
|
|
|