GSoC 2016: Inferring Infobox Template Class Mappings From Wikipedia and WikiData

This page is the public project page and will be updated every week.

Project Description

There are many infoboxes on wikipedia. Here is a example about football box:

Alt text

As seen, every infobox has some properties. Actually, every infobox follows a certain template. In my project, the goal is to find mappings between the classes (eg. dbo:Person, dbo:City) in the DBpedia ontology and infobox templates on pages of Wikipedia resources using techniques of machine learning.

There are lots of infobox mappings available for a few languages, but not as many for other languages. In order to infer mappings for all the languages, cross-lingual knowledge validation should also be considered.

The main output of the project will be a list of new high-quality infobox-class mappings.

Weekly Progess

Week 1 (5.8-5.14)

First meeting with my mentor
Create the public page for the project
Create the google doc for the project

Week 2 (5.15-5.21)

Get the existing mappings
Get the information about all templates: https://github.com/dbpedia/extraction-framework/tree/master/server/src/main/statistics
Parse and clean the data

Week 3 (5.22-5.28)

Figure out the information we have:
- 1) Existing mappings, we have manual information that a template in lang $X$ is mapped to class $Y$
- 2) Inter-language links between templates, e.g. template $X_1$ in lang $X$ is mapped to class $Y$ and there is a link from this template to templates in other languages. This gives a high probability that the equivalent templates in other languages should be mapped to the same class $Y$.
- 3) Links between articles in different languages and templates each article uses, this way, when (2) is not always true we can find which templates are used for the same concepts.
- 4) Most wikidata articles have a type assigned, using this information we have a variation of metric (3) but with manual types assigned.
Process the information (1) - (3) described above and obtain some prelimenary results.

Week 4 (5.29-6.4)

Propose a baseline approach: Given a template classified as an infobox, the approach is instance-based and exploits the already mapped articles and their cross-language links to the articles that contain the template to be mapped. This approach is summaried in the below figure.

Alt text

Experiments:
- Based on existing mappings in English, to create mappings for Chinese.
- Evaluations: try this approach on some other languages which have some existing mappings. (In progress)

Week 5 (6.5-6.11)

Learn the difference between infoboxes and macro templates and try to filter the macro templates.
More experiments.
Start to write a summary about the progress so far.

Week 6 (6.12-6.18)

Complete the code and the documentation, see this repo on github.
Complete the report about the current progress and further work required by the mid-term evaluation.
Starts working on ontology hierarchy and templates filtering.

Week 7 (6.19-6.25)

Use multiple languages to evaluate the predicted results.
Use ontology hierarchy to assign types to articles and evaluate the predicted results.

Week 8 (6.26-7.2)

Complete a script combining all the modules so far together, which can download the data, parse the data, predicted the mappings and evaluate on the predicted results as a whole.
Update the README file and added some figures about the evaluation results on Bulgarian.

Week 9 (7.3-7.9)

Use information in wikidata: Quite a bit entities in wikidata has a DBpedia ontology types assigned already. In addition, we have links between wikidata and other languages. As a result, we can treat wikidata as a pivot language directly. The information from wikidata can be useful to improve the performance of our approach.
Case study on miss classified cases on Bulgarian.

Week 10 (7.10-7.16)

Read papers for further improvements:
Start working on manually checking the predicted mappings in Chinese as final output of the project.

Week 11 (7.17-7.23)

Implement the ideas in this paper on DBpedia.
Use cross-validation to evaluate the performance on link prediction task.

Week 12 (7.24-7.30)

More experiments about tensor factorization on DBpedia. But the results are not that good.

Week 13 (7.31-8.6)

Read papers about graph embeddings and applications on knowledge base: TransE and HOLE
Apply the ideas in above paper on DBpedia. However, the results are still not that good compared to the results presented in the paper.
Use grid search to find the optimal parameter setting to improve the performance.

Week 14 (8.7-8.13)

For RESCAL, with large rank, it can achieve fairly good performance for tasks like type prediction on small languages like Balgarian. The AUC-PR is around 0.8. However, due to the memory limit, RESCAL performs poorly on larger languages like German and English.
Develop two scripts based on RESCAL and HOLE to compute a score for given triples indicating the likelihood of the existance of the triples in DBpedia, which can help determine whether to add new triples to DBpedia.
So far, I almost complete all my work for this project.

Week 15 (8.14-8.19)

Submit the code and complete the final evaluations.

Billy Ian's Short Leisure-time Wander

into learning, investment, intelligence and beyond

GSoC 2016: Inferring Infobox Template Class Mappings From Wikipedia and WikiData

Project Description

Weekly Progess

Comments