# GSoC 2016: Inferring Infobox Template Class Mappings From Wikipedia and WikiData

This page is the public project page and will be updated every week.

### Project Description

There are many infoboxes on wikipedia. Here is a example about football box:

As seen, every infobox has some properties. Actually, every infobox follows a certain template. In my project, the goal is to find mappings between the classes (eg. dbo:Person, dbo:City) in the DBpedia ontology and infobox templates on pages of Wikipedia resources using techniques of machine learning.

There are lots of infobox mappings available for a few languages, but not as many for other languages. In order to infer mappings for all the languages, cross-lingual knowledge validation should also be considered.

The main output of the project will be a list of new high-quality infobox-class mappings.

### Weekly Progess

Week 1 (5.8-5.14)

• First meeting with my mentor
• Create the public page for the project
• Create the google doc for the project

Week 2 (5.15-5.21)

Week 3 (5.22-5.28)

• Figure out the information we have:
• 1) Existing mappings, we have manual information that a template in lang $X$ is mapped to class $Y$
• 2) Inter-language links between templates, e.g. template $X_1$ in lang $X$ is mapped to class $Y$ and there is a link from this template to templates in other languages. This gives a high probability that the equivalent templates in other languages should be mapped to the same class $Y$.
• 3) Links between articles in different languages and templates each article uses, this way, when (2) is not always true we can find which templates are used for the same concepts.
• 4) Most wikidata articles have a type assigned, using this information we have a variation of metric (3) but with manual types assigned.
• Process the information (1) - (3) described above and obtain some prelimenary results.

Week 4 (5.29-6.4)

• Propose a baseline approach: Given a template classified as an infobox, the approach is instance-based and exploits the already mapped articles and their cross-language links to the articles that contain the template to be mapped. This approach is summaried in the below figure.

• Experiments:
• Based on existing mappings in English, to create mappings for Chinese.
• Evaluations: try this approach on some other languages which have some existing mappings. (In progress)

Week 5 (6.5-6.11)

• Learn the difference between infoboxes and macro templates and try to filter the macro templates.
• More experiments.
• Start to write a summary about the progress so far.

Week 6 (6.12-6.18)

• Complete the code and the documentation, see this repo on github.
• Complete the report about the current progress and further work required by the mid-term evaluation.
• Starts working on ontology hierarchy and templates filtering.

Week 7 (6.19-6.25)

• Use multiple languages to evaluate the predicted results.
• Use ontology hierarchy to assign types to articles and evaluate the predicted results.

Week 8 (6.26-7.2)

• Complete a script combining all the modules so far together, which can download the data, parse the data, predicted the mappings and evaluate on the predicted results as a whole.
• Update the README file and added some figures about the evaluation results on Bulgarian.

Week 9 (7.3-7.9)

• Use information in wikidata: Quite a bit entities in wikidata has a DBpedia ontology types assigned already. In addition, we have links between wikidata and other languages. As a result, we can treat wikidata as a pivot language directly. The information from wikidata can be useful to improve the performance of our approach.
• Case study on miss classified cases on Bulgarian.

Week 10 (7.10-7.16)

• Read papers for further improvements:
• Start working on manually checking the predicted mappings in Chinese as final output of the project.

Week 11 (7.17-7.23)

• Implement the ideas in this paper on DBpedia.
• Use cross-validation to evaluate the performance on link prediction task.

Week 12 (7.24-7.30)

• More experiments about tensor factorization on DBpedia. But the results are not that good.

Week 13 (7.31-8.6)

• Read papers about graph embeddings and applications on knowledge base: TransE and HOLE
• Apply the ideas in above paper on DBpedia. However, the results are still not that good compared to the results presented in the paper.
• Use grid search to find the optimal parameter setting to improve the performance.

Week 14 (8.7-8.13)

• For RESCAL, with large rank, it can achieve fairly good performance for tasks like type prediction on small languages like Balgarian. The AUC-PR is around 0.8. However, due to the memory limit, RESCAL performs poorly on larger languages like German and English.
• Develop two scripts based on RESCAL and HOLE to compute a score for given triples indicating the likelihood of the existance of the triples in DBpedia, which can help determine whether to add new triples to DBpedia.
• So far, I almost complete all my work for this project.

Week 15 (8.14-8.19)

• Submit the code and complete the final evaluations.