StarDict 词典文件格式
(2013-03-01 00:14:42)| 分类: *C/CPlusPlus |
Format
for StarDict dictionary files
StarDict
词典文件格式
Extracted
from the 3.0.0 source code
取自 3.0.0
源代码
StarDict
homepage: http://stardict.sourceforge.net
StarDict
主页:http://stardict.sourceforge.net
StarDict
on-line dictionary: http://www.stardict.org
StarDict
在线词典:http://www.stardict.org
一、Number
and Byte-order Conventions
数字 与
字节序 约定
When you
record the numbers that identify sizes, offsets,
etc.,
you
should use 32-bits numbers, such as you might represent with a
glong.
当你记录数字如:确认大小、偏移量、等,你将使用 32 位数,例如你可以用一个 glong
来代表。
In order
to make StarDict work on different
platforms,
these
numbers must be in network byte
order.
为了使
StarDict 工作在不同平台上,这些数字必须用网络字节序。
You can
ensure the correct byte order by using the g_htonl()
function
when
creating dictionary files.
当你创建词典文件时,你可以通过使用 g_htonl
函数来确保正确的字节序。
Conversely, you should use g_ntohl() when
reading dictionary files.
相反地,当读取词典文件时,你将使用 g_ntohl
函数。
Strings
should be encoded in UTF-8.
字符串将是用
UTF-8 来编码。
二、Files
文件
Every
dictionary consists of these files:
每个词典由这些文件构成:
(1)somedict.ifo
<-- 字典信息文件
(2)somedict.idx
<-- 字典单词索引文件
(3)somedict.dict
<-- 字典单词解释文件
(4)somedict.syn
<-- 字典同义词文件
You can
use gzip -9 to compress the .idx
file.
你可以使用
gzip -9 来压缩 .idx 文件。
If the
.idx file are not compressed,
the
loading can be fast and save memory when
using,
compress
it will make the .idx file load into
memory
and make
the quering become faster when using.
如果 .idx
文件没有被压缩的话,当使用时,加载是迅速的且节省内存,
压 .idx
文件到内存,然后当使用时,变得更快。
You can
use dictzip to compress the .dict file.
你可以使用
dictzip 来压缩 .dict 文件。
"dictzip"
uses the same compression algorithm and file format as does
gzip,
but
provides a table
that can
be used to randomly access compressed blocks in the
file.
dictzip
使用与 gzip 相同的压缩算法和文件格式,
但是它提供一个表可以用来在文件中随机访问压缩块。
The use
of 50-64kB blocks for compression typically degrades compression by
less than 10%,
while
maintaining acceptable random access capabilities for all data in
the file.
使用 50~64
KB 大小的块来压缩相对典型压缩降低压缩小于 10%,但是在查询时可随机访问文件的全部数据。
As an
added benefit,
files
compressed with dictzip can be decompressed with
gunzip.
作为另一个好处是,用 dictzip 压缩的文件可以用 gunzip
来解压。
For more
information about dictzip, refer to DICT
project,
please
see: http://www.dict.org
关于
dictzip 更多信息,参考 DICT 工程,请参见 http://www.dict.org
When you
create a dictionary,
you
should use .idx and .dict.dz in normal case.
当你创建一个词典时,通常情况下你将使用 .idx 和 .dict.dz
。
Stardict
will search for the .ifo file, then open the .idx
or
.idx.gz
file and the .dict.dz or .dict file which is in the same directory
and
has the
same base name.
Stardic
将搜寻 .ifo 文件,然后打开相同目录下同名的 .idx 或 .idx.gz
文件
和
.dict.dz 或 .dict 文件。
三、The
".ifo" file's format
.ifo
文件的格式
The .ifo
file has the following format:
.ifo
文件有如下格式:
StarDict's dict ifo file
version=2.4.2
[options]
[多个选项]
Note that
the current "version" string must be "2.4.2" or
"3.0.0".
注意:当前
"version" 字符串必是 2.4.2 或 3.0.0
If it's
not, then StarDict will refuse to read the file.
如果它不是,那么
StarDict 将拒绝读该文件。
If
version is "3.0.0", StarDict will parse the "idxoffsetbits"
option.
如果版是
3.0.0 ,StarDict 将分析 "idxoffsetbits" 选项。
[options]
[多个选项]
---------
In the
example above,
[options]
expands to any of the following lines specifying information about
the dictionary.
在上面的例子中,[options]
扩展为任何如下关于词典的指定信息行。
Each
option is a keyword followed by an equal
sign,
then the
value of that option, then a
newline.
各个选项是一个关键字后跟一个等于符号,然后是选项值,然后是一个换行符。
The
options may be appear in any order.
这些选项不分出现的先后顺序。
Note that
the dictionary must have at least a
bookname,
a
wordcount and a idxfilesize, or the load will
fail.
注意词典必有一个词典名、一个词汇量和一个 .idx
文件大小,否则加载将失败。
All other
information is optional.
全部其它信息是可选择。
All
strings should be encoded in UTF-8.
全部字符串将使用
UTF-8 字符编码。
Available
options:
bookname=
// required
必要项
wordcount=
// required 必要项
synwordcount= // required
if ".syn" file exists. 如果 .syn 文件存在的话,为必要项。
idxfilesize= // required
必要项
idxoffsetbits= // New in 3.0.0 在版本为 3.0.0
时出现
author=
email=
website=
description=
// You can use
for new line. 你可以使用 HTML 标签
来表示换行。
for new line. 你可以使用 HTML 标签
来表示换行。
date=
sametypesequence= // very important.
非常重要。
wordcount
is the count of word entries in .idx file, it must be
right.
wordcount
是单词条目在 .idx 文件中的总数,它必须是正确的。
idxfilesize is the size(in bytes) of the .idx
file,
even the
.idx is compressed to a .idx.gz
file,
this
entry must record the original .idx file's size, and it must be
right too.
idxfilesize 是 .idx 文件的字节大小,其实一个 .idx.gz 文件是
.idx 的压缩文件,
这个选项条目必须记录原始 .idx
文件的大小,且也必须是正确的。
The .gz
file don't contain its original size
information,
but
knowing the original size can speed up the extraction to
memory,
as you
don't need to call realloc() for many times.
.gz
文件不包含它的原始字节大小信息,但是知道原始大小可以加速解压到内存,
同样你不需要多次调用 realloc 函数。
idxoffsetbits can be 64 or
32.
idxoffsetbits 可以是 64 或 32 。
If
"idxoffsetbits=64", the offset field of the .idx file will be 64
bits.
如果是
"idxoffsetbits=64" 的话,.idx 文件将是 64 位偏移范围。
The
"sametypesequence" option is described in further detail
below.
sametypesequence
选项是用于描述如下更多细节。
***
sametypesequence
You
should first familiarize yourself with the .dict file
format
described
in the next section so that you can understand what
effect
this
option has on the .dict file.
你将首先熟悉一下在下一节的 .dict 文件格式描述,以便你可以理解这个选项在 .dict
文件上有什么效果。
If the
sametypesequence option is set,
it tells
StarDict that each word's data
in the
.dict file will have the same sequence of
datatypes.
如果设置了
sametypesequence 选项,
它向
StarDict 提供在 .dict 文件中的每一个单词的数据将有相同顺序的数据类型。
In this
case, we expect a .dict file that's been optimized in two
ways:
在这种情况下,我们预计一个 .dict
文件被优化有两种办法:
the type
identifiers should be omitted,
and the
size marker for the last data entry of each word should be
omitted.
类型标识符将是被忽略,并且大小标记对于每一个单词的最后数据项将是被忽略。
Let's
consider some concrete examples of the sametypesequence
option.
让我们思考一些实际的 sametypesequence
选项实例。
Suppose
that a dictionary records many .wav files, and so
sets:
假设一本词典记录了一些 .wav 文件,且这么设置:
sametypesequence=W
In this
case,
each
word's entry in the .dict file consists solely of a wav
file.
在这种情况下,在
.dict 文件中的每一个单词条目构成单独的一个 wav 文件。
In the
.dict file, you would leave out the 'W' character before each
entry,
and you
would also omit the 32-bits
integer
at the
front of each .wav entry that would normally give the entry's
length.
在 .dict
文件中,你将略去在每个条目之前的 W 字母,
并且你将同样忽略
32 位整数在每个 .wav 条目之前,这通常用于给出条目的长度。
You can
do this since the length is known from the information in the .idx
file.
自从在 .idx
文件信息中知道了长度你就可以做到这一点了。
As
another example, suppose a dictionary contains phonetic
information
and a
meaning for each word.
作为另一个例子,假设一个词典里的每一个单词包含音标的信息和一个意思。
The
sametypesequence option for this dictionary would
be:
sametypesequence 选项对于该词典将是:
sametypesequence=tm
Once
again, you can omit the 't' and 'm' characters before each data
entry in the .dict file.
再一次,你可以忽略
t 和 m 字母在 .dict 文件中的每一个数据条目之前。
In
addition, you should omit the terminating '\0' for the 'm'
entry
for each
word in the .dict file,
as the
length of the meaning string can be
inferred
from the
length of the phonetic string (still indicated by a terminating
'\0')
and the
length of the entire word entry (listed in the .idx
file).
此外,在
.dict 文件中你将忽略每一个 m 条目单词的终止字符 '\0',
作为意思字符串的长度是可以从音标的字符串长度(仍然通过终止字符 '\0'
来指出)
和整个单词条目的长度(在 .idx
文件中已列出)被推算出来的。
So for
cases where the last data entry for each word normally
requires
a
terminating '\0' character, you should omit this character in the
.dict file.
因此,每个单词最后数据项通常需要一个终止字符 '\0',你将在 .dict
文件中忽略该字符。
And for
cases where the last data entry for each word normally
requires
an
initial 32-bits number giving the length of the field (such as WAV
and PNG entries),
you must
omit this number in the dictionary.
并且对于这种情况,每个单词最后数据项通常需要一个 32 位数提供给长度域(例如 WAV 和
PNG 条目),
你必须在词典中忽略该数。
Every
dictionary should try to use the sametypesequence feature to save
disk space.
每一本词典将尝试使用 sametypesequence
功能来节省磁盘空间。
***
四、The
".idx" file's format
.idx
文件的格式
The .idx
file is just a word list.
.idx
文件只是一个单词列表。
The word
list is a sorted list of word entries.
该单词列表是一个已分类的单词条目列表。
Each
entry in the word list contains three fields, one after the
other:
每一个条目在单词列表中包含陆续三个域:
word_str;
// a utf-8 string terminated by
'\0'.
word_data_offset; // word data's offset in
.dict file
word_data_size; // word
data's total size in .dict file
word_str
gives the string representing this
word.
word_str
给出代表该单词的字符串。
It's the
string that is "looked up" by the StarDict.
StarDict
的查找要通过该字符串。
Two or
more entries may have the same "word_str" with
different
word_data_offset and
word_data_size.
两个或更多的条目可以有相同的 word_str 而使用不同的
word_data_offset 和 word_data_size。
This may
be useful for some dictionaries.
这也许对某些词典是有用的。
But this
feature is only well supported by StarDict-2.4.8 and
newer.
但是这个特征仅是被
StarDict-2.4.8 和之上版本更好地支持。
The
length of "word_str" should be less than
256.
word_str
的长度将是小于 256 的。
In other
words, (strlen(word) < 256).
也就是说,strlen(word_str) < 256
。
If the
version is "3.0.0" and
"idxoffsetbits=64",
word_data_offset will be 64-bits unsigned
number in network byte order.
如果版本是
3.0.0 且 idxoffsetbits=64 ,word_data_offset 将是网络字节序 64
位无符号数。
Otherwise
it will be 32-bits.
否则它将是网络字节序 32 位无符号数。
word_data_size should be 32-bits unsigned
number in network byte order.
word_data_size 将是网络字节序 32
位无符号数。
It is
possible the different word_str have the same word_data_offset and
word_data_size,
so
multiple word index point to the same definition.
不同的
word_str 有相同的 word_data_offset 和 word_data_size
是可能的,
以便多个单词索引指向相同定义。
But this
is not recommended, for mutiple words have the same
definition,
you may
create a ".syn" file for them, see section 4
below.
但这是不推荐的,对于多个单词有相同定义,你可以为它们创建一个 .syn
文件,参见下面第四部分。
The word
list must be sorted by calling stardict_strcmp() on the "word_str"
fields.
该单词列表必须通过调用 stardict_strcmp 函数在 word_str
字段上已分过类了。
If the
word list order is wrong, StarDict will fail to function
correctly!
如果该单词列表顺序是错误的,StarDict
将功能失败!
============
gint
stardict_strcmp(const gchar *s1, const gchar *s2)
{
}
============
g_ascii_strcasecmp() is a glib
function:
g_ascii_strcasecmp 函数是一个 glib
函数:
Unlike
the BSD strcasecmp() function,
this only
recognizes standard ASCII letters and ignores the
locale,
treating
all non-ASCII characters as if they are not
letters.
不像 BSD
strcasecmp 函数,这个仅认可标准的 ASCII 字母和忽略区域设置,
处理全部非
ASCII 字符好像它们不是字母。
stardict_strcmp() works fine with English
characters,
but the
other locale characters' sorting is not so
good,
in this
case, you can enable the collation feature, see section
6.
stardict_strcmp
函数处理英文字符很好,但是其它区域字母分类不是那么好,
在这种情况下,你可以开启排序规则功能,参见第六部分。
五、The
".syn" file's format
.syn
文件的格式
This file
is optional, and you should notice tree dictionary needn't this
file.
该文件是可选的,且你将注意到树形词典不需要这个文件。
Only
StarDict-2.4.8 and newer support this file.
仅
StarDict-2.4.8 和更新版支持该文件。
The .syn
file contains information for synonyms, that
means,
when you
input a synonym, StarDict will search another word that related to
it.
.syn
文件包含同义词信息,这意味着当你输入一个同义词时,StarDict 将搜索与它有关系的其它单词。
The
format is simple.
格式是简单的。
Each item
contain one string and a number.
每一项包含一个字符串和一个号码。
synonym_word;
// a utf-8
string terminated by '\0'.
original_word_index; // original word's index
in .idx file.
Then
other items without separation.
因而没有其它项。
When you
input synonym_word, StarDict will search
original_word;
当你输入
synonym_word 时,StarDict 将搜索 original_word ;
The
length of "synonym_word" should be less than
256.
synonym_word 的长度将小于 256 。
In other
words, (strlen(word) < 256).
也就是说,strlen(word) 小于 256 。
original_word_index is a 32-bits unsigned
number in network byte order.
original_word_index 是一个网络字节序的 32
位无符号数。
Two or
more items may have the same "synonym_word" with different
original_word_index.
两个或更多项可以有相同的 synonym_word 和不同的
original_word_index 。
The items
must be sorted by stardict_strcmp() with
synonym_word.
指定项是以
synonym_word 通过 stardict_strcmp 函数来过分类的。
六、The
offset cache file's format
偏移缓存文件的格式
StarDict-2.4.8 start to support cache
files,
this
feature can speed up loading and save memory as mmap() the cache
file.
StarDict-2.4.8 开始支持缓存文件,该功能可以提高加载速度和节省内存如同
mmap 函数的缓存文件。
The cache
file names are .idx.oft and .syn.oft, the format
is:
该缓存文件名为
.idx.oft 和 .syn.oft ,格式是:
First a
utf-8 string terminated by '\0',
then many
32-bits numbers as the wordoffset
index,
this
index is sparse, and
"ENTR_PER_PAGE=32",
they are
not stored in network byte order.
首先是一个以
'\0' 结尾的 utf-8 编码字符串,
然后相对多的 32
位数作为 wordoffset 索引,该索引是稀少的,且 ENTR_PER_PAGE=32 ,
它们不是用网络字节序存储。
The
string must begin with:
字符串必须以此开始:
StarDict's oft file
version=2.4.8
以 '\n'
结尾。
Then a
line like this:
然后是如此一行:
url=/usr/share/stardict/dic/stardict-somedict-2.4.2/somedict.idx
This line
should have a ending '\n'.
注:该行也将是以
'\n' 结尾。
StarDict
will try to create the .oft file at the same directory of the .ifo
file first,
if
failed, then try to create it at
~/.cache/stardict/,
~/.cache
is get by
g_get_user_cache_dir().
StarDict
将首先尝试在 .ifo 文件相同目录中创建 .oft 文件,
如果失败,然后尝试在 ~/.cache/stardict/ 目录中创建, ~/.cache
通过 g_get_user_cache_dir 函数可得到。
If two or
more dictionaries have the same file
name,
StarDict
will create somedict.idx.oft, somedict(2).idx.oft,
somedict(3).idx.oft, etc.
如果两个或更多的词典有相同的文件名,StarDict 将创建
相同词典名.idx.oft、相同词典名(2).idx.oft、
相同词典名(3).idx.oft,等。
for them
respectively, each with different "url=" in the beginning
string.
分别对于它们每个在
url= 后的字符串路径不相同。
七、The
collation file's format
校勘文件的格式
StarDict-2.4.8 start to support
collation,
that sort
the word list by collate
function.
StarDict-2.4.8
开始支持校勘,通过校勘功能分类单词列表。
It will
create collation file which names .idx.clt and
.syn.clt,
the
format is a little like offset cache file:
创建校勘文件用
.idx.clt 和 .syn.clt 来命名,格式有点像偏移缓存文件:
First a
utf-8 string terminated by '\0',
then many
32-bits numbers as the index that sorted by the collate
function,
they are
not stored in network byte order.
首先是一个以
'\0' 结尾的 utf-8 编码字符串,
然后相对多的 32
位数作为校勘功能分类索引,它们不是用网络字节序存储。
The
string must begin with:
字符串必须以此开始:
StarDict's clt file
version=2.4.8
Then two
lines like this:
然后是如此两行:
url=/usr/share/stardict/dic/stardict-somedict-2.4.2/somedict.idx
func=0
The
second line should have a ending '\n' too.
注:第二行将也有一个 '\n' 结尾。
StarDict
support these collate functions currently:
StarDic
目前支持这些校勘功能:
typedef
enum {
}
CollateFunctions;
These
UTF8_*_CI functions comes from MySQL in fact.
这些
UTF8_*_CI 功能实际上来自 MySQL 。
The
file's locate path just like the .oft file.
文件的定位路径与
.oft 文件一样。
Notice,
for "somedict.idx.gz" file,
the
corresponding collation file is
somedict.idx.clt,
but not
somedict.idx.gz.clt,
the
"url=" is somedict.idx, not
somedict.idx.gz.
注意,对于
相同词典名.idx.gz 文件,相应的校勘文件是 相同词典名.idx.clt ,
但不是
相同词典名.idx.gz.clt ,
url= 是
相同词典名.idx ,不是 相同词典名.idx.gz 。
So after
you gzip the .idx file, StarDict needn't create the .clt file
again.
所以在你用
gzip 压缩 .idx 文件之后,StarDict 不需要再创建 .clt 文件。
八、The
".dict" file's format
.dict
文件的格式
The .dict
file is a pure data sequence,
as the
offset and size of each word is recorded in the corresponding .idx
file.
.dict
文件是一个纯数据序列,每个单词的偏移和大小记录在对应的 .idx 文件中。
If the
"sametypesequence" option is not used in the .ifo
file,
then the
.dict file has fields in the following order:
如果
sametypesequence 选项在 .ifo 文件中没有使用,那么 .dict
文件中各域如下列顺序:
==============
// ----
第一个单词的第一个数据域 ----
word_1_data_1_type; // a single char
identifying the data type
word_1_data_1_data; // the
data
// ----
第一个单词的第二个数据域 ----
word_1_data_2_type;
word_1_data_2_data;
...... //
the number of data entries for each word is determined
by
// ----
第二个单词的第一个数据域 ----
word_2_data_1_type;
word_2_data_1_data;
......
==============
It's
important to note that each field in each word indicates its own
length,
as
described below.
值得注意的是每个单词中的每个域指出了它自己的长度,描述如下:
The
number of possible fields per word is also not
fixed,
and is
determined by simply reading
data
until
you've read word_data_size bytes for that word.
每个单词可能的域同样是不固定的,直到你从 .idx 文件中读出单词的
word_data_size 个字节来确定。
Suppose
the "sametypesequence" option is used in the .idx
file,
and the
option is set like this:
假设
sametypesequence 选项在 .idx 文件中已用,且设置成如下:
sametypesequence=tm
Then the
.dict file will look like this:
那么 .dict
文件将看起来像这样:
==============
word_1_data_1_data
word_1_data_2_data
word_2_data_1_data
word_2_data_2_data
......
==============
The first
data entry for each word will have a terminating
'\0',
but the
second entry will not have a terminating
'\0'.
第一个数据项对于每个单词将有一个终止字符
'\0',但是第二个数据项就没有了。
The
omissions of the type chars and of the last field's size
information are the
optimizations required by the
"sametypesequence" option described above.
省略的类型字节和最后的域大小信息是需要通过在上面描述过的 sametypesequence
选项来优化。
If
"idxoffsetbits=64", the file size of the .dict file will be bigger
than 4G.
如果
idxoffsetbits=64 ,那么 .dict 的文件大小将是可以大于 4GB 的。
Because
we often need to mmap this large
file,
and there
is a 4G maximum virtual memory space limit in a process on the 32
bits computer,
which
will make we can get error,
so
"idxoffsetbits=64" dictionary can't be loaded in 32 bits machine in
fact,
StarDict
will simply print a warning in this case when
loading.
因为我们通常需要
mmap 函数来映射这个巨大文件,
在 32
位计算机中一个进程的最大虚拟内存空间就是 4GB ,将会得到报错。
64-bits
computers should haven't this limit.
在 64
位计算机上将无此限制。
九、Type
identifiers
类型标识符
----------------
Here are
the single-character type
identifiers
that may
be used with the "sametypesequence" option in the .ifo
file,
or may
appear in the dict file itself if the "sametypesequence" option is
not used.
这里有一些单字符类型标识符,可以在 .ifo 文件中用于 sametypesequence
选项值,
或如果
sametypesequence 选项是未用时,可在 .dict 文件自身中出现。
Lower-case characters signify that a field's
size is determined by a terminating
'\0',
while
upper-case characters indicate that the data begins with a network
byte-ordered guint32
that
gives the length of the following data's size(NOT the whole size
which is 4 bytes bigger).
小写字符意味一个域大小是由一个终止字符 '\0'
确定的,
而大写字符指出数据开始用一个网络字节序的 guint32
来给出如下数据长度的大小
(不是整个4字节的大小)。
'm'
Word's
pure text meaning.
单词的纯文本意思。
The data
should be a utf-8 string ending with '\0'.
数据将是一个以
'\0' 结尾的 utf-8 编码的字符串。
'l'
Word's
pure text meaning.
单词的纯文本意思。
The data
is NOT a utf-8 string, but is instead a string in locale encoding,
ending with '\0'.
数据不是一个
utf-8 编码的字符串,只是用以 '\0' 结尾的区域设置编码字符串来替代。
Sometimes
using this type will save disk space, but its use is
discouraged.
有时候用这个类型来保存磁盘空间,除非它的使用是不允许的。
'g'
A utf-8
string which is marked up with the Pango text markup
language.
Pango
文本标记语言的 utf-8 编码字符串标记。
For more
information about this markup language, See the "Pango Reference
Manual."
关于此标记语言的更多信息,参见 Pango 参考手册。
You might
have it installed locally at:
file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
你可能有安装它在本地:file:///usr/share/gtk-doc/html/pango/PangoMarkupFormat.html
't'
English
phonetic string.
英语音标字符串。
The data
should be a utf-8 string ending with '\0'.
数据将是一个以
'\0' 结尾的 utf-8 编码字符串。
Here are
some utf-8 phonetic characters:
这是一些
utf-8 音标字符:
θʃŋʧðʒæıʌʊɒɛəɑɜɔˌˈːˑṃṇḷ
æɑɒʌәєŋvθðʃʒɚːɡˏˊˋ
'x'
A utf-8
string which is marked up with the xdxf language.
xdxf 语言的
utf-8 编码字符串标记。
See
http://xdxf.sourceforge.net StarDict have these
extention:
参见
http://xdxf.sourceforge.net StarDict 有如下扩展:
can have
"type" attribute, it can be "image", "sound", "video" and
"attach".
can have
"k" attribute.
'y'
Chinese
YinBiao or Japanese KANA.
中文的音标或日文的假名。
The data
should be a utf-8 string ending with '\0'.
数据将是一个以
'\0' 结尾的 utf-8 编码字符串。
'k'
KingSoft
PowerWord's data. The data is a utf-8 string ending with
'\0'.
金山词霸的数据。数据是以 '\0' 结尾的 utf-8
字符串。
It is in
XML format.
它采用 XML
格式。
'w'
MediaWiki
markup language.
MediaWiki
标签语言。
See
http://meta.wikimedia.org/wiki/Help:Editing#The_wiki_markup
参见
http://meta.wikimedia.org/wiki/Help:Editing#The_wiki_markup
'h'
Html
codes.
HTML
代码。
'r'
Resource
file list.
资源文件列表。
The
content can be:
内容可以是:
img:pic/example.jpg
// Image file 图像文件
snd:apple.wav
// Sound file 声音文件
vdo:film.avi
// Video file
视频文件
att:file.bin
// Attachment file
附件
More than
one line is supported as a list of available
files.
可得到的文件列表支持多行。
StarDict
will find the files in the Resource Storage.
StarDict
将在资源仓库中找到文件。
The image
will be shown, the sound file will have a play
button.
图像将被显示,声音将有一个播放按钮。
You can
"save as" the attachment file and so on.
你可以另存附件等等。
'W'
wav
file.
wav
声音文件。
The data
begins with a network byte-ordered guint32 to identify the wav
file's size,
immediately followed by the file's
content.
数据开始用网络字节序的 guint32 来确定 wav
声音文件的大小,文件内容紧随其后。
'P'
Picture
file.
图像文件。
The data
begins with a network byte-ordered guint32 to identify the picture
file's size,
immediately followed by the file's
content.
数据开始用网络字节序的 guint32
来确定图像文件的大小,文件内容紧随其后。
'X'
this type
identifier is reserved for experimental
extensions.
该类型标识是为实验扩展保留的。
十、Resource Storage
资源仓库
Resource
Storage store the external file in 'r' resource file
list,
the image
in html code, the image, media and other files in wiki
tag.
资源仓库存储外部文件用 r 资源文件列表、图像用 HTML 代码、媒体和其它文件用
wiki 标签。
It have
two forms:
两种形式:
1. Direct
directory and files in the "res" sub-directory.
1.
直接目录和文件在 res 子目录中。
2. The
res.rifo, res.ridx and res.rdic database.
2.
res.rifo、res.ridx 和 res.rdic 数据库。
Direct
files may have file name encoding
problem,
as Linux
use UTF-8 and Windows use local
encoding,
so you'd
better just use ASCII file name,
or use
databse to store UTF-8 file name.
直接文件可能有文件名编码问题,如 linux 使用 utf-8 和 windows
使用区域设置编码,
所以你将最好仅用
ASCII 文件名,或使用数据库来存储 utf-8 文件名。
Databse
may need to extract the file(such as .wav) file to a temporary
file,
so not so
efficient compare to direct
files.
数据库可能需要提取出文件(例如:.wav
文件)作为一个临时文件,
所以不是那么有效率的相对直接文件来说。
But
database have the advantage of compressing.
但是数据库有压缩的优势。
You can
convert the res directory and the res database from each
other
by the
dir2resdatabse and resdatabase2dir tools.
你可以通过
dir2resdatabse 和 resdatabase2dir 这两个工具来彼此转换 res 目录和 res
数据库。
StarDict
will try to load the storage database
first,
then try
the direct files form.
StarDict
将尝试首先来加载存储数据库,然后尝试直接文件形式。
The
format of the res.rifo file:
res.rifo
文件的格式:
StarDict's storage ifo file
version=3.0.0
filecount=
// required.
idxoffsetbits= //
optional.
The
format of the res.ridx file:
res.ridx
文件的格式:
filename;
// A
string end with '\0'.
offset;
// 32 or 64 bits unsigned number in network byte
order.
size;
// 32 bits unsigned number in
network byte order.
filename
can include a path too, such as
"pic/example.png".
文件名也可以包含一个路径,例如:pic/example.png
。
filename
is case sensitive,
and there
should have no two same filenames in all the
entries.
文件名是区分大小写的,并且全部条目中没有两个相同名字的。
if
"idxoffsetbits=64", then offset is 64 bits.
如果
idxoffsetbits=64 的话,那么偏移是 64 位。
These
three items are repeated as each entry.
这三项为每个条目重复出现。
The
entries are sorted by the strcmp() function with the filename
field.
条目是通过
strcmp 函数按文件名域来分类的。
It is
possible that different filenames have the same offset and
size.
不同的文件有相同的偏移和大小是可能的。
The
format of the res.rdic file:
res.rdic
文件的格式:
It is
just the join of each resource files.
它仅是每个资源文件的连接处。
You can
dictzip this file as res.rdic.dz
你可以用
dictzip 压缩该文件为 res.rdic.dz 。
Tree
Dictionary
树形词典
The tree
dictionary support is used for information viewing,
etc.
树型词典是用来支持信息浏览,等。
A tree
dictionary contains three file:
一个树形词典包含三个文件:
sometreedict.ifo, sometreedict.tdx.gz and
sometreedict.dict.dz.
sometreedict.ifo、sometreedict.tdx.gz 和
sometreedict.dict.dz。
It is
better to compress the .tdx file, as it is always load into
memory.
当它总是加载进内存时,压缩 .tdx 文件比较好。
The .ifo
file has the following format:
.ifo
文件有如下格式:
StarDict's treedict ifo file
version=2.4.2
[options]
Available
options:
可用项:
bookname=
// required
必要的
tdxfilesize= // required
必要的
wordcount=
author=
email=
website=
description=
date=
sametypesequence=
wordcount
is only used for info view in the dict manage
dialog,
so it is
not important in tree dictionary.
wordcount
仅是用来在词典管理对话框中浏览信息,所以它在树形词典中不重要。
The .tdx
file is just the word list.
.tdx
文件仅是单词列表。
-----------
The word
list is a tree list of word entries.
单词列表是一个单词条目的树形列表。
Each
entry in the word list contains four fields, one after the
other:
每个条目在单词列表中包含陆续四个域:
Subentry
is immidiately followed by its parent
entry.
子项是紧跟它的父条目。
This make
the order is just as when a tree list with all its nodes
extended,
then sort
from top to bottom.
产生出的顺序正是树形列表全部节点展开自上而下的顺序。
word_data_offset, word_data_size and
word_subentry_count
should be
32-bits unsigned numbers in network byte order.
word_data_offset、word_data_size 和
word_subentry_count 将是网络字节序的 32 位无符号数。
The .dict
file's format is the same as the normal
dictionary.
.dict
文件的格式是与普通词典相同的。
More
information
更多信息
You can
read "src/lib.cpp", "src/dictmanagedlg.cpp" and "src/tools/.cpp"
for more information.
你可以阅读
src/lib.cpp、src/dictmanagedlg.cpp 和 src/tools/.cpp
了解更多信息。
After you
have build a dictionary,
you can
use "stardict_verify" to verify the dictionary
files.
你生成一个词典之后,你可以用 stardict_verify
来验证词典文件。
You can
find it at "src/tools/".
你可以在
src/tools/ 目录下找到它。
If you
have any questions, email me. :)
如果你有任何问题,发 email 给我。:)
Thanks to
Will Robinson for cleaning up this file's English.
感谢 Will
Robinson 整理了该英文文档。
Hu Zheng
http://forlinux.yeah.net
2007.4.24

加载中…