[lkb] the fine system and unicode

Discussion:

Stefan Müller

2006-02-12 12:14:19 UTC

Hi,

I try to set up everything for developing a Chinese grammar. I managed
to talk to emacs and to Prolog, but itsdb++ complains about unknown
words, which probably has to do with some unicode
interpretation/transmission problem. Before I start to examine the issue
further, I´d like to ask if there are any unicode related points that
have to be observed in the context of itsdb++ (like switches, program
versions, locale settings, or similar things).

Thanks and best wishes

Stefan

PS: I use:

export LC_ALL=de_DE.UTF-8

--
Stefan Müller

Universität Potsdam Tel: (+49) (+331) 977-2180

http://www.cl.uni-bremen.de/~stefan/

http://www.cl.uni-bremen.de/~stefan/Babel/Interaktiv/

Francis Bond

2006-02-12 12:22:23 UTC

Permalink

G'day,

there are some notes for working set-ups for Greek and Japanese at:
http://wiki.delph-in.net/moin/LkbGrammarEncodingProposal

Things depend a little on the version of Emacs and ACL (the most
recent patched ACL works slightly differently from the old one).
Also, Ben is working on some new stuff, which may or may not work for
you.

--
Francis Bond <www.kecl.ntt.co.jp/icl/mtg/members/bond/>
NTT Communication Science Laboratories | Machine Translation Research Group

Francis Bond

2006-02-12 12:23:07 UTC

Permalink

Sorry, the URL I meant was: http://wiki.delph-in.net/moin/LkbEmacs.

--
Francis Bond <www.kecl.ntt.co.jp/icl/mtg/members/bond/>
NTT Communication Science Laboratories | Machine Translation Research Group

Stefan Müller

2006-02-14 15:07:45 UTC

Permalink

Hi everybody,

Thank you very much for the quick responses. The problem was the
common-lisp locale. The system runs now! =:-)

Best wishes

Stefan

--
Stefan Müller

Universität Potsdam Tel: (+49) (+331) 977-2180

http://www.cl.uni-bremen.de/~stefan/

http://www.cl.uni-bremen.de/~stefan/Babel/Interaktiv/

Stephan Oepen

2006-02-12 15:08:01 UTC

Permalink

hi stefan,

Post by Stefan MÃ¼ller
I try to set up everything for developing a Chinese grammar. I managed
to talk to emacs and to Prolog, but itsdb++ complains about unknown
words, which probably has to do with some unicode
interpretation/transmission problem. Before I start to examine the issue
further, I´d like to ask if there are any unicode related points that
have to be observed in the context of itsdb++ (like switches, program
versions, locale settings, or similar things).

getting UniCode to work in [incr tsdb()] is not much of a problem. you
should make sure that

(a) your [incr tsdb()] data files (skeletons or ASCII import files)
are all coded in UTF-8.

(b) the Lisp universe running [incr tsdb()] uses a UTF-8 locale; try
evaluating excl:*locale* to check, and then maybe use the -locale
command line option to the underlying Lisp image (ACL appears to
not choose its initial locale based on the LANG shell variable).

(c) assuming you have confirmed the above, creating a new profile and
running `Browse | Items' should display appropriately (if not, it
could also be due to font problems).

i presume you are using an external processor and the [incr tsdb()] C
API? if yes, communication to and from the processing client defaults
to the active coding system in the [incr tsdb()] session, i.e. UTF-8 in
the above scenario. it is possible to force a different coding system
for client communication by virtue of the global *pvm-encoding*, e.g.

(setf *pvm-encoding* :utf-8)

in a per-user `~/.tsdbrc'. however, i would rather recommend running
all processes using the same coding system, preferably UTF-8 nowadays.
in case you are running [incr tsdb()] from within emacs(1), these two
processes too must agree on which coding system to use. the DELPH-IN
wiki and default LKB `dot.emacs' provide useful examples here.

good luck - oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2285 7989
+++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++ --- ***@csli.stanford.edu; ***@ifi.uio.no; ***@oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Ben Waldron

2006-02-13 11:48:33 UTC

Permalink

Post by Stephan Oepen
getting UniCode to work in [incr tsdb()] is not much of a problem. you
should make sure that
(a) your [incr tsdb()] data files (skeletons or ASCII import files)
are all coded in UTF-8.

You can use the 'file' command under Linux to check the encoding of files:

***@bmw-1:~/erg> file irregs.tab
irregs.tab: UTF-8 Unicode text

Post by Stephan Oepen
(b) the Lisp universe running [incr tsdb()] uses a UTF-8 locale; try
evaluating excl:*locale* to check, and then maybe use the -locale
command line option to the underlying Lisp image (ACL appears to
not choose its initial locale based on the LANG shell variable).

An alternative to explicitly setting -locale when starting the Lisp
image is to set the coding system as a property of the grammar files.
E.g. you can place the following in GRAMMAR/lkb/globals:

(when (lkb-version-after-p "2006/02/08 15:00:00")
(set-coding-system utf-8))

OR if your LKB image is old (and you are running Allegro CL):

(setf excl:*locale* (excl::find-locale ".utf8"))

- Ben