Max-Planck-Institut für Informatik
max planck institut
mpii logo Minerva of the Max Planck Society


Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents

Suchanek, Fabian and Ifrim, Georgiana and Weikum, Gerhard

MPI-I-2006-5-004. March 2006, 37 pages. | Status: available - back from printing | Next --> Entry | Previous <-- Entry

Abstract in LaTeX format:
Search engines, question answering systems and classification systems
alike can greatly profit from formalized world knowledge.
Unfortunately, manually compiled collections of world knowledge (such
as WordNet or the Suggested Upper Merged Ontology SUMO) often suffer
from low coverage, high assembling costs and fast aging. In contrast,
the World Wide Web provides an endless source of knowledge, assembled
by millions of people, updated constantly and available for free. In
this paper, we propose a novel method for learning arbitrary binary
relations from natural language Web documents, without human
interaction. Our system, LEILA, combines linguistic analysis and
machine learning techniques to find robust patterns in the text and to
generalize them. For initialization, we only require a set of examples
of the target relation and a set of counterexamples (e.g. from
WordNet). The architecture consists of 3 stages: Finding patterns in
the corpus based on the given examples, assessing the patterns based on
probabilistic confidence, and applying the generalized patterns to
propose pairs for the target relation. We prove the benefits and
practical viability of our approach by extensive experiments, showing
that LEILA achieves consistent improvements over existing comparable
techniques (e.g. Snowball, TextToOnto).
Acknowledgement: We would like to thank Eugene Agichtein for his caring support with Snow-
ball. Furthermore, Johanna Völker and Philipp Cimiano deserve our sincere
thanks for their unreserved assistance with their system.
Categories / Keywords: Ontology Learning, Relation Extraction, Information Extraction, Linguistic
References to related material:

To download this research report, please select the type of document that fits best your needs.Attachement Size(s):
MPI-I-2006-5-004.pdf191 KBytes
Please note: If you don't have a viewer for PostScript on your platform, try to install GhostScript and GhostView
URL to this document:

Hide details for BibTeXBibTeX
  AUTHOR = {Suchanek, Fabian and Ifrim, Georgiana and Weikum, Gerhard},
  TITLE = {Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2006-5-004},
  MONTH = {March},
  YEAR = {2006},
  ISSN = {0946-011X},