Building universe

De Assothink Wiki
Aller à la navigation Aller à la recherche

Goal

The goal if the "building universe" process is the creation of 3 universes

- concepts

- percepts

- variants

and the creation of various files containing the information built:

  • ko.i
  • xx.qp.i
  • xx.uv.i

Compression

All or most files are compressed using the pack class.

They have thus the .pk suffix which is not included here in the file names.

Sources

The sources used for the universe building are (subject to changes):

  • wiktionary
  • wordnet
  • DELA (language only source)
  • FreeBase
  • WikiData
  • BabelNet

All of these sources are widely imperfect, because the underlying model is generally poor.

Here is a comparison of the capabilities herited from various sources:


WordNet
WikiData
FreeBase
BabelNet
Assothink
Provides concept linking
yes
yes (poor)
yes
poorly or indirectly
yes
MultiKeySet
1
1
2
(2)
5...
MultiLang
1
N
N
N
2+
Categorization
5 categs (NVAD

4 categs (NGIO)

4 categs (NGIO)
4 categs (NGIO)

10 categs (NVADQBL)

Encyclopedia connection
no
yes
yes
no
yes
























Universe Size
Small
Huge
Huge
Huge
Small
Implementation
3.0 good but old
~
~
ok
ongoing

Categs legend: N(oun) V(erbs) A(djective) a(D)verb G(eo) I(nidividuals) O(pera) L(ink) B(uiltin)

Steps

Step 0a-wiktionary

Creation of files en.fr.wiktionary and fr.en.wiktionary from massive wiktionary dumps.

Step 0b-translation

Creation of consolidated translation tables, mixing input from wiktionary and google api (and possibly others).

The results areen.fr and fr.en , usable for various translations tasks.

Step 1a-dela

Creation of full dela files xx.qp.delafull xx.uv.delafull xx.qp.uv.delafull.

The result files are sorted.

Step 1b-kob

Creation of kob file. Purely algorithmic. No external source. Concepts of categories QCSR are created from here.

The result file is sorted.

Step 1c-wordnet

Creation of file en.qp.wordnetfull from wordnet data. This file is NOT sorted. It contains all percepts known by wordnet.

Step 1d-universe

Creation of the effective files ko-wn xx.qp categCnt

The logic is based on the contract defining the resulting files

  • All concepts defined have 1 or more q-percept in any language.
  • All q-percepts defined relate to at least 1 concept.
  • All u-variants related to at least 1  q-percept.

Concepts will later be identifiable by their line index in the ko-wn file.

Q-percepts will later be identifiable by their line index in the xx.qp file.

U-variants will later be identifiable by their line index in the xx.uv file.

Step 1e-uv-qp

Creation of files xx.uv.qp.ii andxx.uv

Step 1f-ko-qp

Creation of files xx.ko.qp.ii   wn en.def

Step 1g-fr-def

Creation of file fr.def from en.def using google translation API.

Critical process

Among all steps described above, the weak and critical point is related to the limitation of wordnet, which is delivering english-only synsets. It is thus vital (but extremely difficult) to

- define french percept universe

- translate english wordnet synsets in french

This should be later revisited as an (intelligent) task performed using the active jelly, if possible.