Menu

Datasets

Our study relies on real data from the Billion Triples Challenge 2012 dataset (BTC12), DBpedia, Kasabi and the Linked Archives Hub project. To capture the differences in the heterogeneity and semantic relationships of descriptions, we distinguish between data originating from sources in the center and the periphery of the LOD cloud. In general, central sources, such as DBpedia and Freebase, are derived from a common source, Wikipedia, from which they extract information regarding an entity. Such descriptions often refer to the original wiki page and feature synonym attributes whose values share a significant number of common tokens. Since they have been exhaustively studied in the literature, descriptions across central LOD sources are heavily interlinked using in their majority owl:sameAs links.

All the datasets are stored in RDF in the N-triples format, with triples containing a blank node and triples present in the ground truth removed. Moreover, from all the other datasets, except BTC12DBpedia, we kept only the entity descriptions for which we know their linked description in BTC12DBpedia and removed the rest. This way, we know that any suggested comparison between a pair of descriptions outside the ground-truth is false. Next, we provide statistics about these datasets, for the number of contained triples, descriptions, attributes, and the average number of attribute-value pairs per description. We have also included the number of entity types, taken as the distinct values of the property rdf:type, when provided. Observe that BTC12DBpedia contains more types than attributes. This is due to the fact that DBpedia entities may have multiple types from taxonomic ontologies like Yago. IMDB is the dataset with the highest number of attributevalue pairs per description. To save some space, each subject of an entity description has been replaced with a numeric id that is positive BTC12DBpedia and negative for the other datasets. Each dataset link contains the dataset with numeric ids, as well as the numeric id to URI mappings file. Finally, we have included in each dataset the number of duplicate descriptions based on our ground truth, i.e., descriptions that have been reported to be equivalent (via owl:sameAs links) across all datasets of our testbed. Taking into account the transitivity of equality, those descriptions should be regarded as matches, too.

Copyright notice: The copyright of the datasets belongs to the data providers, which can be found in the links provided in our TBD paper. Those datasets have been processed as descriped in the same publication, with the source code that can be found in the following GitHub repository. Please, review the corresponding copyright notices before using the data. If you use those datasets, please cite our TBD paper as:
Vasilis Efthymiou, Kostas Stefanidis, Vassilis Christophides: Benchmarking Blocking Algorithms for Web Entities. IEEE Trans. Big Data (to appear) (2017)

BTC12DBpedia

Contains descriptions of cross-domain entities, extracted from Wikipedia. This dataset was downloaded from BTC12.

RDF triples 102,306,242
entity descriptions 8,945,920
avg. attribute-value pairs per description 11.44
attributes 36,354
entity types 258,202
attributes/entity types 0.14
duplicates 0

Download BTC12DBpedia

Infoboxes

Contains descriptions from the raw infoboxes of DBpedia 3.5.

RDF triples 27,011,880
entity descriptions 1,638,149
avg. attribute-value pairs per description 16.49
attributes 31,857
entity types 5,535
attributes/entity types 5.76
duplicates 0

Download Infoboxes

BTC12Rest

Originates from the BTC12 dataset, and consists of multiple data sources, like DBLP, geonames and drugbank.

RDF triples 849,656
entity descriptions 31,668
avg. attribute-value pairs per description 26.83
attributes 518
entity types 33
attributes/entity types 15.7
duplicates 863

Download BTC12Rest

BTC12Freebase

Contains descriptions of cross-domain entities. This dataset was downloaded from BTC12.

RDF triples 25,050,970
entity descriptions 1,849,180
avg. attribute-value pairs per description 13.55
attributes 8,323
entity types 8,232
attributes/entity types 1.01
duplicates 12,058

Download BTC12Freebase

BBCmusic

Originates from Kasabi and contains descriptions regarding music bands and artists, extracted from MusicBrainz and Wikipedia.

RDF triples 268,759
entity descriptions 25,359
avg. attribute-value pairs per description 10.60
attributes 29
entity types 4
attributes/entity types 7.25
duplicates 372

Download BBCmusic

LOCAH

For LOCAH, we used the latest published version at Archives hub (March 2014). This, rather small dataset links descriptions of people, from UK archival institutions, with their descriptions in DBpedia.

RDF triples 12,932
entity descriptions 1,233
avg. attribute-value pairs per description 10.49
attributes 14
entity types 4
attributes/entity types 3.5
duplicates 250

Download LOCAH

DBpedia_mov

Contains descriptions of movies from DBpedia. This dataset was used in this work.

RDF triples 180,680
entity descriptions 27,615
avg. attribute-value pairs per description 6.54
attributes 5
entity types 1
attributes/entity types 5
duplicates 0

Link to DBpedia_mov source

IMDB

Contains descriptions of movies from IMDB. This dataset was used in this work.

RDF triples 816,012
entity descriptions 23,182
avg. attribute-value pairs per description 35.20
attributes 7
entity types 1
attributes/entity types 7
duplicates 0

Link to IMDB source


To investigate the ability of blocking algorithms in recognizing relatedness links beyond the owl:sameAs among descriptions, we considered the following Kasabi datasets, linked to DBpedia, the dataset with the highest number of references.

Airports

Contains airport data, linked to BTC12DBpedia with the umbel:isLike property. This property is used to associate entities that may or may not be equivalent, but are believed to be so.

RDF triples 238,973
entity descriptions 12,294
avg. attribute-value pairs per description 19.44
link umbel:isLike
links 12,269

Download Airports

Airlines

Contains airlines data, linked to BTC12DBpedia with the umbel:isLike property. This property is used to associate entities that may or may not be equivalent, but are believed to be so.

RDF triples 15,465
entity descriptions 1,141
avg. attribute-value pairs per description 13.55
link umbel:isLike
links 1,217

Download Airlines

Twitter

Contains data for the presentations of an ESWC conference. It is linked to DBpedia with the dct:subject property, which captures relatedness of entities to topics.

RDF triples 6,743
entity descriptions 2,932
avg. attribute-value pairs per description 2.30
link dct:subject
links 20,671

Download Twitter

Books

Describes books listed in the English language section of Dutch printed book auction catalogues of collections of scholars and religious ministers from the 17th century.

RDF triples 2,993
entity descriptions 748
avg. attribute-value pairs per description 4.00
link dct:subject
links 1,605

Download Books

IATI

Contains data from the International Aid Transparency Initiative. IATI is also connected to DBpedia with the dct:coverage property, which associates an entity to its spatial or temporal topic, its spatial applicability, or the jurisdiction under which it is relevant.

RDF triples 378,130
entity descriptions 31,868
avg. attribute-value pairs per description 11.87
link dct:subject AND dct:coverage
links 23,763 (dct:subject), 7,833(dct:coverage)

Download IATI

WWW2012

Contains data from the WWW2012 conference, linked to DBpedia with the foaf:based near property, which associates an entity to an abstract notion of location.

RDF triples 11,772
entity descriptions 1,547
avg. attribute-value pairs per description 7.61
link foaf:based near
links 1,562

Download WWW2012


Ground Truths

We combine BTC12DBpedia with each of the datasets above, to produce the entity collections, on which we finally ran our experiments. For D2-D5, we consider the owl:sameAs links to/from DBpedia 3.7 (the version used in BTC12). For D1, we consider the subject URIs of Infoboxes that also appear as subjects in BTC12DBpedia. The ground truth of D6 is made of DBpedia movies connected with IMDB movies through the imdbId property. Similarly to D2-D5, we used the available types of links of the datasets that are not linked to BTC12DBpedia using an owl:sameAs relation.



We would like to acknowledge our many collaborators who have influenced our thoughts and our understanding of this research area over the years, and the following projects for their support in our research efforts: EU FP7-ICT-2011-9 DIACHRON (Managing the Evolution and Preservation of the Data Web), EU FP7-PEOPLE- 2013-IRSES SemData (Semantic Data Management), EU FP7-ICT-318552 IdeaGarden (An Interactive Learning Environment Fostering Creativity), and LoDGoV (Generate, Manage, Preserve, Share, and Protect Resources in the Web of Data) of the Research Programme ARISTEIA (EXCELLENCE), GSRT, Ministry of Education, Greece, and the European Regional Development Fund. Finally, we would like to thank the ~okeanos GRNET cloud service.

News

October 2016: Our paper "Benchmarking Blocking Algorithms for Web Entities" was accepted at the IEEE Transactions on Big Data journal.

March 2016: Our paper "Minoan ER: Progressive Entity Resolution in the Web of Data" was presented at EDBT 2016 @ Bordeux, France (March 15-18, 2016).

September 2015: Our papers "Big Data Entity Resolution: From Highly to Somehow Similar Entity Descriptions in the Web" and "Parallel Meta-blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data" were accepted at IEEE Big Data Conference @ Santa Clara, CA (Oct 29- Nov 1, 2015).

August 2015: Our book "Entity Resolution in the Web of Data", got published by Morgan&Claypool. You can find it here.

July 2015: Our poster "WebER: Resolving Entities in the Web", was accepted at the European Data Forum @ Luxembourg (Nov 16-17, 2015).