A Knowledge Graph is a way to represent entities and the relationships between these entities. A Semantic Knowledge Graph (SKG), as described in this paper by Grainger et al., 2016, and explained in this presentation, allows the user to:
“In its most basic use case (a corpus of free-text documents), a Semantic Knowledge Graph can be leveraged to automatically discover domain-specific relationships between entities within a domain.”
Basically this is a tool for exploring a dataset. The user defines the nodes they want to examine, and the SKG returns the edge, if any, as a list of documents, along with entities in those documents.
Okay ‘entities’ may be a bit of misnomer in this case. As implemented in this paper, the SKG is a Solr index, and an edge between two nodes is found by using the two nodes as a search criteria, and the significant terms in the search result define the edge. Interestingly enough, Elasticsearch can do this out of the box with a significant terms aggregation. So the example from slide 48 re-written for Elasticsearch would look something like:
{
"_source": false,
"query" : {
"bool": {
"must": [
{
"match_phrase": {
"body_text" : "jean grey"
}
},
{
"match_phrase": {
"body_text" : "in love"
}
}
]
}
},
"aggregations": {
"keywords" : {
"significant_text" : { "field" : "body_text", "min_doc_count":2, "gnd":{} }
}
}
}
And the results:
"aggregations": {
"keywords": {
"doc_count": 2,
"bg_count": 554724,
"buckets": [
{
"key": "jean",
"doc_count": 2,
"score": 0.7133133259246418,
"bg_count": 138
},
{
"key": "wolverine",
"doc_count": 2,
"score": 0.6697507358526242,
"bg_count": 304
},
{
"key": "grey",
"doc_count": 2,
"score": 0.6543382225202607,
"bg_count": 407
}
]
}
}
Which does work. However of the weaknesses of building an SKG in this manner become apparent. Namely that the aggregation returns terms, not necessarily entities. This probably didn’t matter too much in Grainger’s primary use case which was investigating job descriptions with Careerbuilder’s data. Most of the job skills expressed in his example can be expressed as single terms: ‘java’, ‘hadoop’, etc.
In his presentation, Grainger also explored the Stack Exchange archives to see who was in love with Jean Grey. This exploration of unstructured text documents struck me as a more interesting use case.
In my version the SKG is essentially a python wrapper around a search in elastic with a significant terms aggregation. There is also a parameter to recurse through the returned terms and find new edges, which is something Grainger describes in his presentation. And I have used spaCy to identify all the entities in each post and comment and indexed those as well, using a custom analyzer in elastic to preserve the whitespace in the entity names. So now an aggregation on the “entities” field will return named entities. So this query:
{
"_source": false,
"query" : {
"bool": {
"must": [
{
"match_phrase": {
"entities" : "bruce banner"
}
},
{
"match_phrase": {
"entities" : "iron man"
}
}
]
}
},
"aggregations": {
"keywords" : {
"significant_text" :
{ "field" :
"entities",
"min_doc_count":2,
"gnd":{}
}
}
}
}
Will return:
"aggregations": {
"keywords": {
"doc_count": 23,
"bg_count": 554724,
"buckets": [
{
"key": "bruce banner's",
"doc_count": 3,
"score": 0.822958823926925,
"bg_count": 16
},
{
"key": "hulkbuster",
"doc_count": 2,
"score": 0.8229399281981236,
"bg_count": 2
},
{
"key": "hawkeye",
"doc_count": 3,
"score": 0.8137725253591913,
"bg_count": 24
},
{
"key": "black widow",
"doc_count": 3,
"score": 0.8137725253591913,
"bg_count": 24
},
{
"key": "onslaught",
"doc_count": 2,
"score": 0.8136048265539749,
"bg_count": 4
},
{
"key": "bruce banners",
"doc_count": 2,
"score": 0.8136048265539749,
"bg_count": 4
},
{
"key": "quinjet",
"doc_count": 2,
"score": 0.810387391168628,
"bg_count": 5
},
{
"key": "bruce banner",
"doc_count": 15,
"score": 0.8098473132902205,
"bg_count": 126
},
{
"key": "the battle of new york",
"doc_count": 2,
"score": 0.8013578289217543,
"bg_count": 9
},
{
"key": "venom",
"doc_count": 2,
"score": 0.7530327846085136,
"bg_count": 35
}
]
}
}
Sweet! “black widow” and “the battle of new york” are two of the entities that define the edge between Bruce Banner and Iron Man.
Obviously there is still room for improvement, three separate versions of “bruce banner” are being returned; we could add some type of entity disambiguation. And I think there might be a use case to train spaCy to recognize some additional domain specific entities; perhaps a new entity category for science fiction concepts like ‘time travel’, ‘Crisis on Infinite Earths’, or ‘cyberpunk.’
My repo is here. Grainger’s original implementation in Solr is available here.