Prosopographic Modelling - Individuals

boerner-ingo

Prosopographic Modelling - Individuals

Written by: Ingo Börner
Published on: June 9, 2021
Tagged with: Prosopographic data and Semantic web

Prosopographic Modelling - Individuals

In the context of this course on prosopographic modelling using semantic technologies, individuals are part of the projects NAMPI or VieCPro that can be described by using semantic technologies. In NAMPI, individuals are based on specific events that are connected to individuals persons. So, if various people attend the same event, this event will have a different connotation for each of them. For example, if somebody receives the title of abbot, during this event the change of their title is important for this individual clergyman. Other people might participate as witnesses, the bishop awards the title, etc. Semantic technology is used to express these relationships. In VieCPro, information about members of the Viennese Court, entities, is recorded. This information can come from various sources, but it is connected through semantic technologies, so that statements about these people can be made. For example, you have information in three different records and you can note down that somebody whom you have found is a person, this person works at the Court and has left their post at a specific date. All this will be information that can be accessed about an individual in VieCPro.

Main concepts

Class vs. instance/individual
rdf:type
external Reference resources (wikidata, geonames, GND, dbpedia)
owl:sameAs

Class vs. instance/individual

A simple fact about an entity that is identified by the URI <https://viecpro.acdh.oeaw.ac.at/entity/140450/> could be: “The thing with the number 140450 is a person.” To express this information that the individual is of a certain type, the VieCPro-Project re-uses the Class “Person” E21_Person taken from the CIDOC-CRM Ontology:

In RDF Turtle notation the information is recorded as follows:

<https://viecpro.acdh.oeaw.ac.at/entity/140450/> rdf:type <http://www.cidoc-crm.org/cidoc-crm/E21_Person> .

In an abbreviated form, the RDF turtle notation uses a as a shorthand for rdf:type and also defines the prefix crm: for abbreviating the URIs of the classes and properties of the CIDOC ontology. the same information can be expressed as follows:

@prefix crm: <http://www.cidoc-crm.org/cidoc-crm/> .

<https://viecpro.acdh.oeaw.ac.at/entity/140450/> a crm:E21_Person .

This states that the individual identified by <https://viecpro.acdh.oeaw.ac.at/entity/140450/> is of the class E21_Person.

External reference resources

The datasets contain a lot of triples stating information about the individuals of certain classes, e.g. that our individual <https://viecpro.acdh.oeaw.ac.at/entity/140450/> is called “August Sinzendorf” and that he died on December 31th of 1677. But, in general, the information on this entity in rather scarce, and it contains only the most basic information. In an ideal digital world not every and each project should research these basic facts, e.g. date and place of birth of a person, but should rely on already available information on the web. There are some good sources of these kind of data, to which we refer as external reference resources. Examples are the GND (Gemeinsame Normdatei) or Wikidata, which provides it’s data as Linked Data and also has a SPARQL Endpoint with a relatively comfortable Query Interface.

To give an example: There is an article in Wikipedia on the Bavarian-Austrian noble family Sinzendorf: https://de.wikipedia.org/wiki/Sinzendorf_(Adelsgeschlecht). The corresponding item on Wikidata can be accessed via the menu item Wikidata Item (Wikidata Datenobject) – https://www.wikidata.org/wiki/Q325728 . Please bear in mind, that this is the web representation of the item only and should not be mistaken for the actually URI: http://www.wikidata.org/entity/Q325728 (http:// instead of https:// and /entity/ instead of /wiki/).

The Data can be queried via SPARQL directly, e.g. Coat of Arms of the Sinzendorf Family.

`owl:sameAs`

OWL is used to express description logic. We use owl:sameAs to express that two individuals are actually the same thing. This implies, that every statement that is true about one thing identified by an URI, is also true about the other.

Sample datasets

The datasets used for the following exercises are taken from the projects NAMPI and VieCPro. They can be queried in the Research Space-Instance. The URIs of the relevant named graphs are:

Entities of the NAMPI-Project: <https://nampi.org/entities>
Entities of the VieCPro-Project: <https://viecpro.acdh-dev.oeaw.ac.at/entities#>
Deduplicated/Unique Persons of the VieCPro-Project: <https://deduplication.in.viecpro.acdh-dev.oeaw.ac.at/>
Candidates for Linking to Wikidata: <https://matchingcandidates.viecpro.acdh-dev.oeaw.ac.at/>

Exercise

Task 1: Explore the individual of the datasets

How many individuals are there of which class/type?

The projects NAMPI and ViecPro use different ontologies to model their data. The individual entities (persons, places, …) are instances of different classes. You can list the classes and count the individuals with the following SPARQL query:

SELECT ?class (COUNT(?entity) AS ?cnt) (SAMPLE(?entity) AS ?example) (SAMPLE(?label) AS ?exampleLabel) FROM <https://nampi.org/entities> WHERE {
  ?entity a ?class .
  OPTIONAL { ?entity rdfs:label ?label }
}
GROUP BY ?class
ORDER BY DESC(?cnt)

This selects all classes in the named graph of the NAMPI-Project (… FROM <https://nampi.org/entities>…). We want to include the URI of each class that is bound to the variable ?class. By using the function COUNT we count all individuals per class, that are grouped in the lower part of the query by GROUP BY. Per grouped ?class we include a sample individual in our output: The individual’s URI is bound to the variable ?example and its label to ?label.

In the WHERE clause we specify that we want to query each entity ?entity that is rdf:type (in the query we use the short notation a) of any class ?class. If the entity has an optional (OPTIONAL) rdf:label attached, we assign it to the variable ?label. We are grouping everything by ?class and sort the results in descending order it by the number of individuals ?cnt per class.

What are the most frequent classes in NAMPI?

To get an overview of the VieCPro dataset adapt the query to list classes and count the individuals. Hint: You only have to replace the URI of the named graph in the FROM statement.

Solution:

SELECT ?class (COUNT(?entity) AS ?cnt) (SAMPLE(?entity) AS ?example) (SAMPLE(?label) AS ?exampleLabel) FROM <https://viecpro.acdh-dev.oeaw.ac.at/entities#> WHERE {
  ?entity a ?class .
  OPTIONAL { ?entity rdfs:label ?label }
}
GROUP BY ?class
ORDER BY DESC(?cnt)

Do we have owl:sameAs properties?

We normally use the property owl:sameAs to create links between entities and external data sets or reference resources. The property implies that two entities are exactly the same. To check if our data sets include links to external resources, we can look for this property by the following SPARQL queries:

We can simply list all entites that are in the subject-position of a triple containing owl:sameAs as the predicate:

SELECT ?entity (?o AS ?sameAs) WHERE {
  ?entity owl:sameAs ?o .
}

Normally it’s a good idea to include a LIMIT statement at the end of the query to prevent the system of returning all results and eventually crashing something.

We can also group the results by the named graph they are contained in and count the individuals having these links:

SELECT (SAMPLE(?g) AS ?dataset) (COUNT(?entity) AS ?cntEntity)  WHERE {
  GRAPH ?g { ?entity owl:sameAs ?o . }
}
GROUP BY ?g

Are the references to external reference resources maybe available under another property?

The property owl:sameAs is not the only that is used to express the notion of something being equivalent to an entity in an external resource. NAMPI uses a different property. Can you find out which one it is? It’s helpful to look into one entity and explore it in the Ontodia view in Research Space. For example look for the entity called “Hieronymus (Übelbacher) von Dürnstein” (<https://purl.org/nampi/data/person/2a06e212-8567-4fe2-9547-79d9b3cd462a>).

Solution:

NAMPI also uses a property called “sameAs” but not from the OWL Ontology but from schema.org. In the dataset schema:sameAs connects entites to several external reference resources. To find out which ones the project uses, we can list them:

SELECT (SAMPLE(?baseuri) AS ?externalReferenceResource) (COUNT(?entity) AS ?cnt) FROM <https://nampi.org/entities> WHERE {
  ?entity schema:sameAs ?o .
  BIND(STR(?o) AS ?uri)
  BIND(REPLACE(?uri,'/Q?[0-9]+X?$','') AS ?baseuri)
}
GROUP BY ?baseuri
ORDER BY DESC(?cnt)

In the query we BIND a part of the URI of each external URI to a variable ?uri, convert it to a string and use the function REPLACE to remove the individual ID of the URI using a so called regular expression. The results are then grouped by the resulting variable ?baseuri. You will receive a list of external resources and the count of individuals.

Which external reference resources are used in NAMPI?

Solution:

GeoNames geographical database, Wikidata and GND

Task 2: Investigate possible links to external reference resources

As we found out NAMPI uses schema:sameAs to connect its dataset to the linked data cloud. In VieCPro there are no explicit connections. Furthermore, the same individuals are included in the dataset multiple times. We created a named graph <https://deduplication.in.viecpro.acdh-dev.oeaw.ac.at/> in which we used owl:sameAs to link the individual items together.

Query this graph and look at one entity to understand how the linking is done. Hint: use the triple ?entity owl:sameAs ?item to get relevant results from the named graph!

Linking the entities together already simplifies working with the VieCPro data set, but so far we only have internal connections.

It would be nice to connect some of the included entities to external reference resources. Finding the right candidate can’t often be done automatically.

In the beginning, we looked at the external reference resource Wikidata and investigated the entity representing the noble familiy of Sinzendorf, see: https://www.wikidata.org/wiki/Q325728. If you look at the item in Wikidata again, you see that there is the information, and that this entity is described by a certain reference resource which could be of use for the project: “Biographisches Lexikon des Kaiserthums Oesterreich” (biographical lexicon of the Austrian Empire) (https://www.wikidata.org/wiki/Q665807). Persons included in this 60-volume biographical work by Constant von Wurzbach have been put into Wikidata and thus can be queried by using SPARQL in wikidata’s query interface – just follow this link to view the results of the following query:

SELECT DISTINCT ?item ?itemLabel ?itemDescription ?dateBirth ?dateDeath
WHERE
{
  ?item wdt:P1343 wd:Q665807 ;
        wdt:P31 wd:Q5 .

  OPTIONAL { ?item wdt:P569 ?dateBirth . }
  OPTIONAL { ?item wdt:P570 ?dateDeath . }


  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de,ru,fr,es,it,ja,zh" }
}
LIMIT 500

We downloaded the results and included them in a separate named graph in the Research Space instance <https://wurzbach-wikidata.acdh-dev.oeaw.ac.at/>. We then tried to match the individuals of the class Person in VieCPro with these entities by comparing the labels only. The results have been included in a separate named graph <https://matchingcandidates.viecpro.acdh-dev.oeaw.ac.at/>. It contains triples stating that a person that is in “Wurzbach” and a deduplicated person have the same name. We therefore introduced a custom property “identicalName” <https://summer2020.acdh-dev.oeaw.ac.at/custom_properties/identicalName> because owl:sameAs would not have been semantically correct because we don’t know if persons that have the same name are really the same entity.The graph contains some 30 candidates and you can investigate them and see if there are any true overlaps.

It’s helpful to fetch some information from all the relevant graphs and include some information from Wikidata as well. We therefore prepared the following complex query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX nampi: <https://purl.org/nampi/owl/core#>
PREFIX custom: <https://summer2020.acdh-dev.oeaw.ac.at/custom_properties/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT ?entity (SAMPLE(?name) AS ?entityName) (SAMPLE(?wd) AS ?wikidataUri) (SAMPLE(?WurzbachArticle) AS ?BioArcticleURI ) (SAMPLE(?firstLine) AS ?WurzbachText) (SAMPLE(?WikidataDeathYear) AS ?yearOfDeathWikidata) (SAMPLE(?deathYear) AS ?yearOfDeathViecPro) WHERE {
	GRAPH <https://matchingcandidates.viecpro.acdh-dev.oeaw.ac.at/>
  {
    ?entity custom:identicalName ?wd .

  }

  GRAPH <https://deduplication.in.viecpro.acdh-dev.oeaw.ac.at/>  {
  	?entity rdfs:label ?name ;
           owl:sameAs ?item .
	}


  GRAPH <https://viecpro.acdh-dev.oeaw.ac.at/entities#> {
  	?death crm:P100_was_death_of ?item ;
         crm:P4_has_time-span ?deathTs .

    ?deathTs crm:P82a_begin_of_the_begin ?deathDate .

    BIND(YEAR(?deathDate) as ?deathYear)
    BIND(MONTH(?deathDate) as ?deathMonth)
    BIND(DAY(?deathDate) as ?deathDay)

  }


  SERVICE <https://query.wikidata.org/sparql> {
  		?wd wdt:P569 ?WikidataDateBirth ;
        	wdt:P570 ?WikidataDateDeath .

    	?WurzbachArticle wdt:P921 ?wd ;
         wdt:P1433 wd:Q665807 ;
         wdt:P1922 ?firstLine .

    BIND(YEAR(?WikidataDateDeath) as ?WikidataDeathYear)
    BIND(MONTH(?WikidataDateDeath) as ?WikidataDeathMonth)
    BIND(DAY(?WikidataDateDeath) as ?WikidataDeathDay)

}
   FILTER(?WikidataDeathYear = ?deathYear)
   #FILTER(?WikidataDeathMonth = ?deathMonth)
   #FILTER(?WikidataDeathDay = ?deathDay)
}
GROUP BY ?entity

Let’s look at the WHERE clause first: We query several named graphs:

GRAPH <https://matchingcandidates.viecpro.acdh-dev.oeaw.ac.at/>
{
  ?entity custom:identicalName ?wd .

}

This part retrieves the Wikidata URIs from the matching candidates named graph and assigns them to the variable ?wd (Wikidata).

The next part retrieves the normalized name in rdfs:label and the connections to items that are linked together in the deduplication graph.

When we have all the linked items, we retrieve the date of death, split this information up in year of death, month of death and day of death, e.g. BIND(YEAR(?deathDate) as ?deathYear). We assign this information to corresponding variables. We will later use these variables to compare the information gathered in VieCPro with the possible matches in Wikidata.

The part of the query following the clause SERVICE <https://query.wikidata.org/sparql> is a so called federated query. This actually retrieves additional information directly from Wikidata’s SPARQL endpoint https://query.wikidata.org/sparql. From Wikidata we retrieve the birth- and death date of the entities which we can compare to VieCPro’s data.

We also retrieve the relevant article from Wurzbach and can even include the first line of the entry or so.

In the last part we filter entities where the death year of wikidata and VieCPro match: FILTER(?WikidataDeathYear = ?deathYear) and finally group them by the URI of the ?entity from our deduplication graph which we queried in the beginning.

Because we have a lot of duplicates in the data and we are grouping the results, we need to include samples in the SELECT clause. We display the following variables: ?entity, ?entityName, ?wikidataUri, URI of the article from Wurzbach in Wikidata ?BioArcticleURI and the article’s first line in ?WurzbachText; and the years of death of the entity ?yearOfDeathWikidata and ?yearOfDeathViecPro.

Investigate the results of the query! Are there any true matches? How can these information be put back into the system?

Additional Task 3: Can we do dirty matching on strings (names)

Exact Match on rdfs:label of a person in both datasets

SELECT *
WHERE
{
  GRAPH <https://nampi.org/entities> { ?person a nampi:person ;
    rdfs:label ?personLabel .
    #FILTER(REGEX(?personLabel,'^Z'))
  }

  GRAPH <https://viecpro.acdh-dev.oeaw.ac.at/entities#> { ?person2 a crm:E21_Person ;
    rdfs:label ?person2Label .
    BIND(STR(?person2Label) AS ?person2LabelString)
    #FILTER(REGEX(?person2Label,'^Z'))
  }
  FILTER(?personLabel = ?person2LabelString)
}

Match on surname

SELECT ?person ?personLabel ?person2 ?person2Label
WHERE
{
  GRAPH <https://nampi.org/entities> { ?person a nampi:person ;
    rdfs:label ?personLabel .
    BIND(REPLACE(?personLabel,"^([A-Z][a-z]*?) .*?$","$1") AS ?forename)
    BIND(REPLACE(?personLabel,"^.*?\\s([A-Za-z]+)$","$1") AS ?surname)
  }

  GRAPH <https://viecpro.acdh-dev.oeaw.ac.at/entities#> { ?person2 a crm:E21_Person ;
    rdfs:label ?person2Label .
    BIND(REPLACE(?person2Label,"^([A-Z][a-z]*?) .*?$","$1") AS ?forename2)
    BIND(REPLACE(?person2Label,"^.*?\\s([A-Za-z]+)$","$1") AS ?surname2)
  }
  FILTER(?surname = STR(?surname2))
  #FILTER(?forename = STR(?forename2))
}

Prosopographic Modelling - Individuals

Prosopographic Modelling - Individuals

Main concepts

Class vs. instance/individual

External reference resources

owl:sameAs

Recommended reading

Sample datasets

Exercise

Task 1: Explore the individual of the datasets

Task 2: Investigate possible links to external reference resources

Additional Task 3: Can we do dirty matching on strings (names)

`owl:sameAs`