Our study relies on real data from the Billion Triples Challenge 2012 dataset (BTC12), DBpedia, Kasabi and the Linked Archives Hub project. To capture the differences in the heterogeneity and semantic relationships of descriptions, we distinguish between data originating from sources in the center and the periphery of the LOD cloud. In general, central sources, such as DBpedia and Freebase, are derived from a common source, Wikipedia, from which they extract information regarding an entity. Such descriptions often refer to the original wiki page and feature synonym attributes whose values share a significant number of common tokens. Since they have been exhaustively studied in the literature, descriptions across central LOD sources are heavily interlinked using in their majority owl:sameAs links.
All the datasets are stored in RDF in the N-triples format, with triples containing a blank node and triples present in the ground truth removed. Moreover, from all the other datasets, except BTC12DBpedia, we kept only the entity descriptions for which we know their linked description in BTC12DBpedia and removed the rest. This way, we know that any suggested comparison between a pair of descriptions outside the ground-truth is false. Next, we provide statistics about these datasets, for the number of contained triples, descriptions, attributes, and the average number of attribute-value pairs per description. We have also included the number of entity types, taken as the distinct values of the property rdf:type, when provided. Observe that BTC12DBpedia contains more types than attributes. This is due to the fact that DBpedia entities may have multiple types from taxonomic ontologies like Yago. IMDB is the dataset with the highest number of attributevalue pairs per description. To save some space, each subject of an entity description has been replaced with a numeric id that is positive BTC12DBpedia and negative for the other datasets. Each dataset link contains the dataset with numeric ids, as well as the numeric id to URI mappings file. Finally, we have included in each dataset the number of duplicate descriptions based on our ground truth, i.e., descriptions that have been reported to be equivalent (via owl:sameAs links) across all datasets of our testbed. Taking into account the transitivity of equality, those descriptions should be regarded as matches, too.
Copyright notice: The copyright of the datasets belongs to the data providers,
which can be found in the links provided in our TBD paper.
Those datasets have been processed as descriped in the same publication, with the source code that can be found in the
following GitHub repository.
Please, review the corresponding copyright notices before using the data. If you use those datasets, please cite our TBD paper as:
Vasilis Efthymiou, Kostas Stefanidis, Vassilis Christophides:
Benchmarking Blocking Algorithms for Web Entities. IEEE Trans. Big Data (to appear) (2017)
Contains descriptions of cross-domain entities, extracted from Wikipedia. This dataset was downloaded from BTC12.
RDF triples | 102,306,242 |
entity descriptions | 8,945,920 |
avg. attribute-value pairs per description | 11.44 |
attributes | 36,354 |
entity types | 258,202 |
attributes/entity types | 0.14 |
duplicates | 0 |
Contains descriptions from the raw infoboxes of DBpedia 3.5.
RDF triples | 27,011,880 |
entity descriptions | 1,638,149 |
avg. attribute-value pairs per description | 16.49 |
attributes | 31,857 |
entity types | 5,535 |
attributes/entity types | 5.76 |
duplicates | 0 |
Originates from the BTC12 dataset, and consists of multiple data sources, like DBLP, geonames and drugbank.
RDF triples | 849,656 |
entity descriptions | 31,668 |
avg. attribute-value pairs per description | 26.83 |
attributes | 518 |
entity types | 33 |
attributes/entity types | 15.7 |
duplicates | 863 |
Contains descriptions of cross-domain entities. This dataset was downloaded from BTC12.
RDF triples | 25,050,970 |
entity descriptions | 1,849,180 |
avg. attribute-value pairs per description | 13.55 |
attributes | 8,323 |
entity types | 8,232 |
attributes/entity types | 1.01 |
duplicates | 12,058 |
Originates from Kasabi and contains descriptions regarding music bands and artists, extracted from MusicBrainz and Wikipedia.
RDF triples | 268,759 |
entity descriptions | 25,359 |
avg. attribute-value pairs per description | 10.60 |
attributes | 29 |
entity types | 4 |
attributes/entity types | 7.25 |
duplicates | 372 |
For LOCAH, we used the latest published version at Archives hub (March 2014). This, rather small dataset links descriptions of people, from UK archival institutions, with their descriptions in DBpedia.
RDF triples | 12,932 |
entity descriptions | 1,233 |
avg. attribute-value pairs per description | 10.49 |
attributes | 14 |
entity types | 4 |
attributes/entity types | 3.5 |
duplicates | 250 |
Contains descriptions of movies from DBpedia. This dataset was used in this work.
RDF triples | 180,680 |
entity descriptions | 27,615 |
avg. attribute-value pairs per description | 6.54 |
attributes | 5 |
entity types | 1 |
attributes/entity types | 5 |
duplicates | 0 |
Contains descriptions of movies from IMDB. This dataset was used in this work.
RDF triples | 816,012 |
entity descriptions | 23,182 |
avg. attribute-value pairs per description | 35.20 |
attributes | 7 |
entity types | 1 |
attributes/entity types | 7 |
duplicates | 0 |
To investigate the ability of blocking algorithms in recognizing relatedness links beyond the owl:sameAs among descriptions, we considered the following Kasabi datasets, linked to DBpedia, the dataset with the highest number of references.
Contains airport data, linked to BTC12DBpedia with the umbel:isLike property. This property is used to associate entities that may or may not be equivalent, but are believed to be so.
RDF triples | 238,973 |
entity descriptions | 12,294 |
avg. attribute-value pairs per description | 19.44 |
link | umbel:isLike |
links | 12,269 |
Contains airlines data, linked to BTC12DBpedia with the umbel:isLike property. This property is used to associate entities that may or may not be equivalent, but are believed to be so.
RDF triples | 15,465 |
entity descriptions | 1,141 |
avg. attribute-value pairs per description | 13.55 |
link | umbel:isLike |
links | 1,217 |
Contains data for the presentations of an ESWC conference. It is linked to DBpedia with the dct:subject property, which captures relatedness of entities to topics.
RDF triples | 6,743 |
entity descriptions | 2,932 |
avg. attribute-value pairs per description | 2.30 |
link | dct:subject |
links | 20,671 |
Describes books listed in the English language section of Dutch printed book auction catalogues of collections of scholars and religious ministers from the 17th century.
RDF triples | 2,993 |
entity descriptions | 748 |
avg. attribute-value pairs per description | 4.00 |
link | dct:subject |
links | 1,605 |
Contains data from the International Aid Transparency Initiative. IATI is also connected to DBpedia with the dct:coverage property, which associates an entity to its spatial or temporal topic, its spatial applicability, or the jurisdiction under which it is relevant.
RDF triples | 378,130 |
entity descriptions | 31,868 |
avg. attribute-value pairs per description | 11.87 |
link | dct:subject AND dct:coverage |
links | 23,763 (dct:subject), 7,833(dct:coverage) |
Contains data from the WWW2012 conference, linked to DBpedia with the foaf:based near property, which associates an entity to an abstract notion of location.
RDF triples | 11,772 |
entity descriptions | 1,547 |
avg. attribute-value pairs per description | 7.61 |
link | foaf:based near |
links | 1,562 |
We would like to acknowledge our many collaborators who have influenced our thoughts and our understanding of this research area over the years, and the following projects for their support in our research efforts: EU FP7-ICT-2011-9 DIACHRON (Managing the Evolution and Preservation of the Data Web), EU FP7-PEOPLE- 2013-IRSES SemData (Semantic Data Management), EU FP7-ICT-318552 IdeaGarden (An Interactive Learning Environment Fostering Creativity), and LoDGoV (Generate, Manage, Preserve, Share, and Protect Resources in the Web of Data) of the Research Programme ARISTEIA (EXCELLENCE), GSRT, Ministry of Education, Greece, and the European Regional Development Fund. Finally, we would like to thank the ~okeanos GRNET cloud service.
October 2016: Our paper "Benchmarking Blocking Algorithms for Web Entities" was accepted at the IEEE Transactions on Big Data journal.
March 2016: Our paper "Minoan ER: Progressive Entity Resolution in the Web of Data" was presented at EDBT 2016 @ Bordeux, France (March 15-18, 2016).
September 2015: Our papers "Big Data Entity Resolution: From Highly to Somehow Similar Entity Descriptions in the Web" and "Parallel Meta-blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data" were accepted at IEEE Big Data Conference @ Santa Clara, CA (Oct 29- Nov 1, 2015).
August 2015: Our book "Entity Resolution in the Web of Data", got published by Morgan&Claypool. You can find it here.
July 2015: Our poster "WebER: Resolving Entities in the Web", was accepted at the European Data Forum @ Luxembourg (Nov 16-17, 2015).