dusko + unicode   181

command line - What protocol/standard is used by terminals? - Unix & Linux Stack Exchange
Properly-written Unix programs don't emit these escape sequences directly. Instead, they use one of the libraries mentioned above, telling it to "move the cursor to position (1,1)" or whatever, and the **library** emits the necessary terminal control codes based on your TERM environment variable setting. This allows the program to work properly no matter what terminal type you run it on.
xterm  terminal  commandline  cli  shell  console  x11  xorg  unix  utf8  unicode  ansi  ascii 
3 days ago by dusko
Full list of X compose key combinations
A full list of X compose key combinations can be found online (http://cgit.freedesktop.org/xorg/lib/libX11/plain/nls/en_US.UTF-8/Compose.pre), or locally in /usr/share/X11/locale/en_US.UTF-

cgit.freedesktop.org/xorg/lib/libX11/plain/nls/en_US.UTF-8/Compose.pre
x11  xorg  xterm  keyboard  utf8  unicode  unix 
4 weeks ago by dusko
keyboard - How to type special characters in Linux? - Super User
X uses something called the compose key. By pressing Compose, some key, some key... in sequence, you can input characters. I have my compose key set to Menu; to type a © (copyright symbol), I would use Menu, o, c.

A full list of X compose key combinations can be found online (http://cgit.freedesktop.org/xorg/lib/libX11/plain/nls/en_US.UTF-8/Compose.pre), or locally in /usr/share/X11/locale/en_US.UTF-8/Compose.
xterm  terminal  x11  xorg  utf8  unicode  unix  keyboard 
4 weeks ago by dusko
How do I type arbitrary unicode characters in xterm? - Unix & Linux Stack Exchange
xterm doesn't implement a hexadecimal-input feature because all of the text editors which handle UTF-8 provide their own equivalents (emacs, vim and vile, of course, even nano).

This could be useful in a shell script, but is not often mentioned.

====

As Thomas Dickey explains (https://unix.stackexchange.com/a/280468), xterm has no built-in way to input characters by codepoint. (Presumably because that's pretty bad UX.)

Vim does, though (http://vimdoc.sourceforge.net/htmldoc/mbyte.html#utf-8-typing): in insert mode, press Ctrl+V then u then 4 hex digits (or Ctrl+V then U then 8 hex digits).

For a more convenient way to input characters, use Compose, digraphs (which are Vim's built-in compose facility) (http://vimdoc.sourceforge.net/htmldoc/digraph.html#digraphs), or an input method suited to the language you're writing in.
terminal  xterm  shell  script  vi  vim  unicode  utf8 
4 weeks ago by dusko
Control character - Wikipedia
In computing and telecommunication, a control character or non-printing character (NPC) is a code point (a number) in a character set, that does not represent a written symbol. They are used as in-band signaling to cause effects other than the addition of a symbol to the text. All other characters are mainly printing, printable, or graphic characters, except perhaps for the "space" character (see ASCII printable characters).

All entries in the ASCII table below code 32 (technically the C0 control code set) are of this kind, including CR and LF used to separate lines of text. The code 127 (DEL) is also a control character[citation needed]. Extended ASCII sets defined by ISO 8859 added the codes 128 through 159 as control characters, this was primarily done so that if the high bit was stripped it would not change a printing character to a C0 control code, but there have been some assignments here, in particular NEL. This second set is called the C1 set.
ascii  ansi  utf8  unicode  reference 
4 weeks ago by dusko
Regex Tutorial - Non-Printable Characters
Non-Printable Characters
You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), and \f (form feed, 0x0C). Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.

In some flavors, \v matches the vertical tab (ASCII 0x0B). In other flavors, \v is a shorthand that matches any vertical whitespace character. That includes the vertical tab, form feed, and all line break characters. Perl 5.10, PCRE 7.2, PHP 5.2.4, R, Delphi XE, and later versions treat it as a shorthand. Earlier versions treated it as a needlessly escaped literal v. The JGsoft flavor originally matched only the vertical tab with \v. JGsoft V2 matches any vertical whitespace with \v.

-------------------

\t = tab
\r = newline
\n = line feed (what is the difference between this and \r ?)

Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.
regex  perl  programming  unix  unicode  reference  howto  tutorial 
4 weeks ago by dusko
Unicode Characters Table
Since its conception, ASCII codes knew many evolutions and, in the 1990's, evolved to a new code called Unicode™ that handles alphabets of many nations and symbols.

The Unicode code space is divided into 17 planes (http://ascii-table.com/unicode-planes.php). Each plane contains 65,536 code points (16-bit) and consists of several charts.

If the Unicode standard can handle up to 1,114,112 characters (http://ascii-table.com/unicode-characters.php), it currently assigns characters to more than 96,000 of those code points.

The first 256 characters table is identical to the ISO 8859-1 character set (the ANSI table (http://ascii-table.com/ansi-table.php) is identical to the ISO 8859-1 table, except in the range 80h-9Fh where we can find C1 control characters). The first 128 characters table is hence identical to the standard ASCII table (http://ascii-table.com/ascii.php).
unicode  utf8  ascii  ansi  reference 
4 weeks ago by dusko
How can I add more keyboards to X server - fbxkb home page
Q4. How can I add more keyboards to X server

You can use setxkbmap to do it. For example, to load 3 keyboards - english, german and italian (us, de, it) and to switch betwean them using both shifts, run this:

setxkbmap -option grp:switch,grp:shifts_toggle,grp_led:scroll us,de,it

Alternatively, you can edit /etc/X11/XF86Config and restart X server, but any subsequent setxkbmap will overwrite those settings. Here is quote from my XF86Config

Section "InputDevice"
Identifier "Keyboard0"
Driver "kbd"
Option "XkbLayout" "us,ru(phonetic)"
Option "XkbOptions" "grp:shifts_toggle,grp_led:scroll"
EndSection

And same thing with setxkbmap

setxkbmap -option grp:switch,grp:shifts_toggle,grp_led:scroll 'us,ru(phonetic)'
x11  xorg  keyboard  unicode  utf8  linux  bsd  freebsd 
5 weeks ago by dusko
Unicode Text Converter
Convert plain text (letters, sometimes numbers, sometimes punctuation) to obscure characters from Unicode. The output is fully cut-n-pastable text.
unicode 
6 weeks ago by dusko
uni WebAssembly demo
uni queries the Unicode database from the commandline.

There are four commands: identify codepoints in a string, search for codepoints, print codepoints by class, block, or range, and emoji to find emojis.

It includes full support for Unicode 12.1 (May 2019) with full Emoji support (a surprisingly large amount of emoji pickers don't deal with emoji sequences very well).
unicode  reference  cli  commandline  terminal  web  webbrowser 
6 weeks ago by dusko
uni - Query the Unicode database from the commandline, with good support for emojis
uni queries the Unicode database from the commandline.

There are four commands: identify codepoints in a string, search for codepoints, print codepoints by class, block, or range, and emoji to find emojis.

It includes full support for Unicode 12.1 (May 2019) with full Emoji support (a surprisingly large amount of emoji pickers don't deal with emoji sequences very well).
unicode  reference  cli  commandline  terminal 
6 weeks ago by dusko
Free Unicode Character Detector for Text Messages
Identify Unicode characters that force text messages into Unicode format.
unicode  php 
8 weeks ago by dusko
Keyboard Layouts - ASCII Table
These are the keyboard layouts used in the world:
...
unicode  keyboard  reference 
december 2019 by dusko
Fonts on Unix
FreeType is the most popular font rasteriser library on Free Unixes, it’s small, efficient, highly customizable, portable, and under two free licenses, a BSD-like one and a GPL one.

FreeType has the widest range of supported font formats in the world.

Thus it’s used in a lot of places like the Android operating system, the playstation, Apple uses it next to its AAT in iOS and macOS, and it’s used in the OpenJDK platform.

...

There’s yet another layer if you remember correctly and that is the font layout engine.

They are FriBidi, HarfBuzz, ICU, m17n, and SIL Graphite.

Those engine are mainly used to support internationalization, as in multiple different languages with different shaping and layout rules.

Let’s only discuss one of them, HarfBuzz.

It sits on top of FreeType as an OpenType Layout engine, opentype being the de-facto font format that support complex text rendering on Free Unix.

It’s the library that actually understands the sophisticated features inside the font. It keeps track in a sort of state machine of the glyphs that need to be drawn, rearranged, reshaped, inserted, in different situations and contexts.

...

N.B.: TrueType and OpenType are mostly identical even some fonts with a ttf extension are actually OpenType fonts.
fonts  reference  linux  unix  freebsd  bsd  unicode  typography 
november 2019 by dusko
xterm unicode font - Unix & Linux Stack Exchange
xterm supports a single font (no font-sets, which are merged at runtime). None of the TrueType fonts covers enough of CJK to be interesting. The bitmap fonts used in xterm's default resource settings are good enough for most uses.
xterm  xorg  x11  unicode  terminal 
november 2019 by dusko
Banish the � with Unifont - Banish missing glyphs with Unifont
The GNU Unifont project is amazing. It contains every Unicode glyph in one single file! I am going to argue that you should bundle it with your apps, your operating systems, and - at a pinch - your websites.
fonts  unicode  free  opensource 
november 2019 by dusko
Unicode Text Converter
Convert plain text (letters, sometimes numbers, sometimes punctuation) to obscure characters from Unicode. The output is fully cut-n-pastable text.
unicode  utf8  software  web  tool 
november 2019 by dusko
CJK fonts - xterm(235) - UTF-8
$ xterm -fn "-gnu-unifont-medium-r-normal-*-iso10646-1" -u8

$ printf "abcdefgh\r\xE7\x89\xB9\xE5\x88\xA5XY\n"
xterm  terminal  shell  cli  utf8  unicode 
november 2019 by dusko
CJK Type Blog - CJK Fonts, Character Sets & Encodings. All CJK.
CJK Fonts, Character Sets & Encodings. All CJK. #AllOfTheTime.
fonts  unicode 
november 2019 by dusko
Han unification - Wikipedia
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese (hanzi), Japanese (kanji), and Korean (hanja).
unicode  fonts 
november 2019 by dusko
Alan Wood’s Unicode resources
Unicode and multilingual support in HTML, fonts, Web browsers and other applications.
unicode  fonts  reference  html  web  tex  latex 
november 2019 by dusko
Whitespace character - Wikipedia
(Redirected from ␣ (https://en.wikipedia.org/w/index.php?title=%E2%90%A3&redirect=no) )

"Dot space" redirects here. For the animated film, see Dot in Space.
"␣" redirects here. It is not to be confused with ⌴ (https://en.wikipedia.org/wiki/%E2%8C%B4).
fonts  unicode  typography 
october 2019 by dusko
Non-breaking space - Wikipedia
"⍽" redirects here. It is not to be confused with ␣ (https://en.wikipedia.org/wiki/%E2%90%A3) or ⌴ (https://en.wikipedia.org/wiki/%E2%8C%B4).

In word processing and digital typesetting, a non-breaking space (" ") (also called no-break space, non-breakable space (NBSP), hard space, or fixed space) is a space character that prevents an automatic line break at its position. In some formats, including HTML, it also prevents consecutive whitespace characters from collapsing into a single space.

In HTML, the common non-breaking space, which is the same width as the ordinary space character, is encoded as   or  . In Unicode, it is encoded as U+00A0.

Non-breaking space characters with other widths also exist.
fonts  tex  latex  writing  web  html  unicode  typography 
october 2019 by dusko
input encodings - How to detect the range of possible character codes? - TeX - LaTeX Stack Exchange
I learned that there are different TeX engines.
E.g., there are TeX, eTeX, pdfTeX, pdfeTeX, LuaTeX, XeTeX.

If I got it right, TeX, eTeX and pdfTeX deal with 8bit encodings and therefore with these engines the range of possible character codes (numerical values for primitives like \endlinechar, \newlinechar, \char, \lccode, \uccode, \catcode etc) is 0-255.

If I got it right, LuaTeX and XeTeX deal with utf8-encoding.

What are the ranges of possible character codes with these engines?

Is there a method for (expandable and) reliably detecting what engine is in use und thus which range of character codes is available?
unicode  utf8  tex  latex 
october 2019 by dusko
Mojibake - Wikipedia
Mojibake (文字化け; IPA: [mod͡ʑibake]) is the garbled text that is the result of text being decoded using an unintended character encoding.[1] The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.

This display may include the generic replacement character ("�") in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding, when the same binary code constitutes one symbol in the other encoding. This is either because of differing constant length encoding (as in Asian 16-bit encodings vs European 8-bit encodings), or the use of variable length encodings (notably UTF-8 and UTF-16).

Failed rendering of glyphs due to either missing fonts or missing glyphs in a font is a different issue that is not to be confused with mojibake. Symptoms of this failed rendering include blocks with the code point displayed in hexadecimal or using the generic replacement character. Importantly, these replacements are valid and are the result of correct error handling by the software.
unicode  utf8  shell  x11  xorg  unix  fonts 
october 2019 by dusko
highlight-bad-chars -- Atom package to highlight bad characters such as No-break space
Atom package to highlight bad characters such as No-break space ( ) and the Greek question mark (;) in your source files.

With this package you'll easily notice invisible and easy-to-confuse characters, which can be the cause for incredibly annoying syntax errors in source code.

Save yourself the burden of debugging invisible bugs for hours!
unicode  utf8  fonts 
october 2019 by dusko
Font for representing Unicode non‐printable characters
> I need a good font for representing control characters. ... it seems they don’t exist in the ɢɴᴜ unifont
A: No, they do. GNU Unifont has complete coverage as seen in Glyph Mini.
...

Again, the code editor is a challenge but it's not the font, it's the software you use the control character to select it that's the problem... For example, same ETX, when pasted into Sublime Text 3 is rendered as the sequence of characters <0x03> and you can copy-paste it as ETX. Atom shows it invisible but you can get plugin to alert you about the presence of indivisible characters, https://atom.io/packages/highlight-bad-chars.
unicode  utf8  fonts 
october 2019 by dusko
Unicode - How to get the characters right?
The world would be much simpler if only one character encoding existed. That would have been clear enough for everyone. Unfortunately the truth is different. There are a lot of different character encodings, each with its own charsets and numeral mappings. So it may be obvious that a character which is converted to bytes using character encoding X may not be the same character when it is converted back from bytes using character encoding Y. That would in turn lead to confusion among humans, because they wouldn't understand the way the computer represented their natural language. Humans would see completely different characters and thus not be able to understand the "language" which is also known as the "mojibake" (http://en.wikipedia.org/wiki/Mojibake). It can also happen that humans would not see any linguistic character at all, because the numeral representation of the character in question isn't covered by the numeral mapping of the character encoding used. It's simply unknown.

How such an unknown character is displayed differs per application which handles the character. In the webbrowser world, Firefox would display an unknown character as a black diamond with a question mark in it, while Internet Explorer would display it as an empty white square with a black border. Both represents the same Unicode character though: xFFFD (http://www.fileformat.info/info/unicode/char/fffd/index.htm), which is displayed in your webbrowser as "�".
unicode  utf8  fonts  java 
october 2019 by dusko
Avoid printing unicode replacement character in Java - Stack Overflow
One of the most likely scenarios is that you are trying to read ISO-8859 data using the UTF-8 character set. If you come across a sequence of characters that is not valid UTF-8, then it will be replaced with the � symbol.
...
> In java, why does Character.toString((char) 65533) print out this symbol: � ?
A: Because exact this particular character IS associated with the particular codepoint (http://www.fileformat.info/info/unicode/char/fffd/index.htm). It does not display a random character as you seem to think.

> I have a java program which prints these characters all over the place. Its a big program. Any ideas on what I can do to avoid this?
A: Your problem lies somewhere else. It at least boils down that you should set every step which involves byte-char conversions (storing text in file/db, reading text from file/db, manipulating text, transferring text, displaying text, etcetera) to use UTF-8.
...
It's a "special" character. But since it even has a font representation I'd certainly call it a character, even if it is used as substitute. There are plenty of unused code points, let's not confuse things further.
unicode  utf8  java  programming  fonts 
october 2019 by dusko
Character encoding - PHP: How to encode U+FFFD in order to do a replace? - Stack Overflow
Well, "�" is a "real" character in itself.
...
But those are real text characters, if I see �, I know there was an error and can't know what the real text character behind it was meant to be unless I decode properly.
...
but judging from the surrounding text it doesn't look like there was literally meant to be the character � so it must have been some kind of error in the history of that file.
unicode  utf8  fonts  php  programming 
october 2019 by dusko
What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text
This article is about encodings and character sets. An article by Joel Spolsky entitled The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (http://www.joelonsoftware.com/articles/Unicode.html) is a nice introduction to the topic and I greatly enjoy reading it every once in a while. I hesitate to refer people to it who have trouble understanding encoding problems though since, while entertaining, it is pretty light on actual technical details. I hope this article can shed some more light on what exactly an encoding is and just why all your text screws up when you least need it. This article is aimed at developers (with a focus on PHP), but any computer user should be able to benefit from it.
unicode  utf8  ascii  programming  reference 
october 2019 by dusko
Unicode Lookup - Convert special characters
Unicode Lookup is an online reference tool to lookup Unicode and HTML special characters, by name and number, and convert between their decimal, hexadecimal, and octal bases.
unicode  utf8  html  reference 
october 2019 by dusko
How to enter arbitrary Unicode code points into Latex
The easy way

There are other approaches but this is by far the easiest solution.

Make sure XeTeX is installed. It is installed in all the major Latex distributions for all the major platforms (Windows, Mac OS X, Linux). XeTeX is a reengineered version of the TeX typesetting engine. The main difference is that XeTeX reads and understands (UTF-8 encoded) Unicode text. No hacks necessary to do that anymore. It also makes any fonts installed on your system automatically accessible to you in Latex, eliminating the need to run complicated scripts and what not.
tex  latex  utf8  unicode 
october 2019 by dusko
LaTeX Source Code Listings - List of characters and their escaped versions
The Wikibook on LaTeX has a ready-made list of characters and their escaped versions that will "cover most characters in latin languages" that you can copy into your document instead of writing the entire thing yourself.
tex  latex  reference  utf8  unicode 
october 2019 by dusko
latex - Difference between XeLaTeX and pdfLaTeX - Stack Overflow
PdfTeX and XeTeX and the equivalent commands for latex are two implementations for the same purpose, as you have pointed out already. The Wikipedia articles have more details on the history and development.

One of the main differences from an operational point of view is that XeTeX has better support for fonts -- in particular you can use system fonts instead of only TeX fonts. It also has better support for non-latin character encodings.

And UTF8 (so no babel nightmares), plus including nearly any of the common graphics formats w/o the need of any prior conversion.

So, xelatex can be used on the same files that pdflatex can render?

As far as I know yes.
tex  latex  fonts  utf8  unicode 
october 2019 by dusko
« earlier      
per page:    204080120160

Copy this bookmark:



description:


tags: