Building universe
Goal
The goal if the "building universe" process is the creation of 3 universes
- concepts
- percepts
- variants
and the creation of various files containing the information built:
- ko.i
- xx.qp.i
- xx.uv.i
Compression
All or most files are compressed using the pack class.
They have thus the .pk suffix which is not included here in the file names.
Sources
The sources used for the universe building are (subject to changes):
- wiktionary
- wordnet
- DELA (language only source)
- FreeBase
- WikiData
- BabelNet
All of these sources are widely imperfect, because the underlying model is generally poor.
Here is a comparison of the capabilities herited from various sources:
WordNet |
WikiData |
FreeBase |
BabelNet |
Assothink | |
---|---|---|---|---|---|
Provides concept linking |
yes |
yes (poor) |
yes |
poorly or indirectly |
yes |
MultiKeySet |
1 |
1 |
2 |
(2) |
5... |
MultiLang |
1 |
N |
N |
N |
2+ |
Categorization |
5 categs (NVAD |
4 categs (NGIO) |
4 categs (NGIO) |
4 categs (NGIO) |
10 categs (NVADQBL) |
Encyclopedia connection |
no |
yes |
yes |
no |
yes |
Universe Size |
Small |
Huge |
Huge |
Huge |
Small |
Implementation |
3.0 good but old |
~ |
~ |
ok |
ongoing |
Categs legend: N(oun) V(erbs) A(djective) a(D)verb G(eo) I(nidividuals) O(pera) L(ink) B(uiltin)
Steps
Step 0a-wiktionary
Creation of files en.fr.wiktionary and fr.en.wiktionary from massive wiktionary dumps.
Step 0b-translation
Creation of consolidated translation tables, mixing input from wiktionary and google api (and possibly others).
The results areen.fr and fr.en , usable for various translations tasks.
Step 1a-dela
Creation of full dela files xx.qp.delafull xx.uv.delafull xx.qp.uv.delafull.
The result files are sorted.
Step 1b-kob
Creation of kob file. Purely algorithmic. No external source. Concepts of categories QCSR are created from here.
The result file is sorted.
Step 1c-wordnet
Creation of file en.qp.wordnetfull from wordnet data. This file is NOT sorted. It contains all percepts known by wordnet.
Step 1d-universe
Creation of the effective files ko-wn xx.qp categCnt
The logic is based on the contract defining the resulting files
- All concepts defined have 1 or more q-percept in any language.
- All q-percepts defined relate to at least 1 concept.
- All u-variants related to at least 1 q-percept.
Concepts will later be identifiable by their line index in the ko-wn file.
Q-percepts will later be identifiable by their line index in the xx.qp file.
U-variants will later be identifiable by their line index in the xx.uv file.
Step 1e-uv-qp
Creation of files xx.uv.qp.ii andxx.uv
Step 1f-ko-qp
Creation of files xx.ko.qp.ii wn en.def
Step 1g-fr-def
Creation of file fr.def from en.def using google translation API.
Critical process
Among all steps described above, the weak and critical point is related to the limitation of wordnet, which is delivering english-only synsets. It is thus vital (but extremely difficult) to
- define french percept universe
- translate english wordnet synsets in french
This should be later revisited as an (intelligent) task performed using the active jelly, if possible.