Knowledge Graph Completion
Semantic information can be extracted by public collaborative efforts, such as Wikipedia, or curated by closed groups, and this information can be managed in a knowledge graph (KG). The knowledge graph can be represented as a multi-relational graph, with a heterogeneous set of vertex and edge types. The edge type indicates a semantic relationship between vertices, and there may be multiple relationships between any pair of vertices. Although the size of KGs can be huge, KGs often suffer from an abundance of missing information. KG completion is the task of inferring missing information by analyzing the semantic information in the graph. Figure 1 and 2 show examples of a single-relational graph and a multi-relational graph respectively. In the first example, the single-relational graph has a homogeneous set of vertices (i.e., they all represent Articles) and of edge types (representing the relation ’cites’). The multi-relational graph has a heterogeneous set of vertices (i.e., Articles and Persons) and edge types (i.e., cites, reviewed and authored).
KGs store knowledge as a set of triples, with each triple consisting of two entities (i.e., subject and object) and the type of relationship between them (i.e., predicate). For example (Elvis_Presley, actedIn, Flaming_Star), as shown in Figure 3. Typically, the size of any particular knowledge graph is increasing over time. For instance, Figure 4 shows how the number of person entities is increasing in DBpedia. However, KGs often suffer from numerous missing information in the knowledge graphs. For example, 71% of all people do not have their place of birth listed in the Freebase KG.
There is also an another missing data issue when it comes to generating a knowledge graph from input documents that have missing facts. When we try to extract knowledge from text data, we realize that there are missing facts that do not exist (or at least are not explicitly stated) in the documents. This is because humans will often, either deliberately or unintentionally, omit certain facts related to the document topic (for instance, if they are considered to be self-evident or unnecessary). Therefore, if we generate a graph from documents that have missing facts, we would naturally expect the corresponding knowledge graph to also have missing vertices and edges. The right hand graphic in Figure 5 shows a generated graph from documents relating to nuclear weapons. Vertices represent countries, weapons, etc., and edge labels indicate relationships between vertices such as tested, stockpiled, etc. Figure 6 shows an enriched version of this graph. Missing facts (or at least educated guesses) can be discovered by inferring missing relationships in the knowledge graph.
Semantic information in KG can be preserved in embedding vector spaces, and missing information in the graph can be found efficiently by using the semantic information in the vector spaces. Recently, knowledge graph embedding methods have shown good results on this inference problem. The enriched knowledge graph can subsequently be used for more accurate analysis of the domain. Such missing data can reduce the power of a model or can lead to a biased model. It can also lead to erroneous predictions or classifications. The task of knowledge graph completion is to infer missing entities, entity types and relation types.