Datasets

Our study relies on real data from the Billion Triples Challenge 2012 dataset (BTC12), DBpedia, Kasabi and the Linked Archives Hub project. To capture the differences in the heterogeneity and semantic relationships of descriptions, we distinguish between data originating from sources in the center and the periphery of the LOD cloud. In general, central sources, such as DBpedia and Freebase, are derived from a common source, Wikipedia, from which they extract information regarding an entity. Such descriptions often refer to the original wiki page and feature synonym attributes whose values share a significant number of common tokens. Since they have been exhaustively studied in the literature, descriptions across central LOD sources are heavily interlinked using in their majority owl:sameAs links.

All the datasets are stored in RDF in the N-triples format, with triples containing a blank node and triples present in the ground truth removed. Moreover, from all the other datasets, except BTC12DBpedia, we kept only the entity descriptions for which we know their linked description in BTC12DBpedia and removed the rest. This way, we know that any suggested comparison between a pair of descriptions outside the ground-truth is false. Next, we provide statistics about these datasets, for the number of contained triples, descriptions, attributes, and the average number of attribute-value pairs per description. We have also included the number of entity types, taken as the distinct values of the property rdf:type, when provided. Observe that BTC12DBpedia contains more types than attributes. This is due to the fact that DBpedia entities may have multiple types from taxonomic ontologies like Yago. IMDB is the dataset with the highest number of attributevalue pairs per description. To save some space, each subject of an entity description has been replaced with a numeric id that is positive BTC12DBpedia and negative for the other datasets. Each dataset link contains the dataset with numeric ids, as well as the numeric id to URI mappings file. Finally, we have included in each dataset the number of duplicate descriptions based on our ground truth, i.e., descriptions that have been reported to be equivalent (via owl:sameAs links) across all datasets of our testbed. Taking into account the transitivity of equality, those descriptions should be regarded as matches, too.

Copyright notice: The copyright of the datasets belongs to the data providers, which can be found in the links provided in our TBD paper. Those datasets have been processed as descriped in the same publication, with the source code that can be found in the following GitHub repository. Please, review the corresponding copyright notices before using the data. If you use those datasets, please cite our TBD paper as:
Vasilis Efthymiou, Kostas Stefanidis, Vassilis Christophides: Benchmarking Blocking Algorithms for Web Entities. IEEE Trans. Big Data (to appear) (2017)

BTC12DBpedia

Contains descriptions of cross-domain entities, extracted from Wikipedia. This dataset was downloaded from BTC12.

RDF triples	102,306,242
entity descriptions	8,945,920
avg. attribute-value pairs per description	11.44
attributes	36,354
entity types	258,202
attributes/entity types	0.14
duplicates	0

Download BTC12DBpedia

Infoboxes

Contains descriptions from the raw infoboxes of DBpedia 3.5.

RDF triples	27,011,880
entity descriptions	1,638,149
avg. attribute-value pairs per description	16.49
attributes	31,857
entity types	5,535
attributes/entity types	5.76
duplicates	0

Download Infoboxes

BTC12Rest

Originates from the BTC12 dataset, and consists of multiple data sources, like DBLP, geonames and drugbank.

RDF triples	849,656
entity descriptions	31,668
avg. attribute-value pairs per description	26.83
attributes	518
entity types	33
attributes/entity types	15.7
duplicates	863

Download BTC12Rest

BTC12Freebase

Contains descriptions of cross-domain entities. This dataset was downloaded from BTC12.

RDF triples	25,050,970
entity descriptions	1,849,180
avg. attribute-value pairs per description	13.55
attributes	8,323
entity types	8,232
attributes/entity types	1.01
duplicates	12,058

Download BTC12Freebase

BBCmusic

Originates from Kasabi and contains descriptions regarding music bands and artists, extracted from MusicBrainz and Wikipedia.

RDF triples	268,759
entity descriptions	25,359
avg. attribute-value pairs per description	10.60
attributes	29
entity types	4
attributes/entity types	7.25
duplicates	372

Download BBCmusic

LOCAH

For LOCAH, we used the latest published version at Archives hub (March 2014). This, rather small dataset links descriptions of people, from UK archival institutions, with their descriptions in DBpedia.

RDF triples	12,932
entity descriptions	1,233
avg. attribute-value pairs per description	10.49
attributes	14
entity types	4
attributes/entity types	3.5
duplicates	250

Download LOCAH

DBpedia_mov

Contains descriptions of movies from DBpedia. This dataset was used in this work.

RDF triples	180,680
entity descriptions	27,615
avg. attribute-value pairs per description	6.54
attributes	5
entity types	1
attributes/entity types	5
duplicates	0

Link to DBpedia_mov source

IMDB

Contains descriptions of movies from IMDB. This dataset was used in this work.

RDF triples	816,012
entity descriptions	23,182
avg. attribute-value pairs per description	35.20
attributes	7
entity types	1
attributes/entity types	7
duplicates	0

Link to IMDB source

To investigate the ability of blocking algorithms in recognizing relatedness links beyond the owl:sameAs among descriptions, we considered the following Kasabi datasets, linked to DBpedia, the dataset with the highest number of references.

Airports

Contains airport data, linked to BTC12DBpedia with the umbel:isLike property. This property is used to associate entities that may or may not be equivalent, but are believed to be so.

RDF triples	238,973
entity descriptions	12,294
avg. attribute-value pairs per description	19.44
link	umbel:isLike
links	12,269

Download Airports

Airlines

Contains airlines data, linked to BTC12DBpedia with the umbel:isLike property. This property is used to associate entities that may or may not be equivalent, but are believed to be so.

RDF triples	15,465
entity descriptions	1,141
avg. attribute-value pairs per description	13.55
link	umbel:isLike
links	1,217

Download Airlines

Twitter

Contains data for the presentations of an ESWC conference. It is linked to DBpedia with the dct:subject property, which captures relatedness of entities to topics.

RDF triples	6,743
entity descriptions	2,932
avg. attribute-value pairs per description	2.30
link	dct:subject
links	20,671

Download Twitter

Books

Describes books listed in the English language section of Dutch printed book auction catalogues of collections of scholars and religious ministers from the 17th century.

RDF triples	2,993
entity descriptions	748
avg. attribute-value pairs per description	4.00
link	dct:subject
links	1,605

Download Books

IATI

Contains data from the International Aid Transparency Initiative. IATI is also connected to DBpedia with the dct:coverage property, which associates an entity to its spatial or temporal topic, its spatial applicability, or the jurisdiction under which it is relevant.

RDF triples	378,130
entity descriptions	31,868
avg. attribute-value pairs per description	11.87
link	dct:subject AND dct:coverage
links	23,763 (dct:subject), 7,833(dct:coverage)

Download IATI

WWW2012

Contains data from the WWW2012 conference, linked to DBpedia with the foaf:based near property, which associates an entity to an abstract notion of location.

RDF triples	11,772
entity descriptions	1,547
avg. attribute-value pairs per description	7.61
link	foaf:based near
links	1,562

Download WWW2012

Ground Truths

We combine BTC12DBpedia with each of the datasets above, to produce the entity collections, on which we finally ran our experiments. For D2-D5, we consider the owl:sameAs links to/from DBpedia 3.7 (the version used in BTC12). For D1, we consider the subject URIs of Infoboxes that also appear as subjects in BTC12DBpedia. The ground truth of D6 is made of DBpedia movies connected with IMDB movies through the imdbId property. Similarly to D2-D5, we used the available types of links of the datasets that are not linked to BTC12DBpedia using an owl:sameAs relation.

D1 combines BTC12DBpedia with Infoboxes. Since it contains two versions of the same dataset, it is considered as a homogeneous collection. This is the biggest collection in terms of triples, as well as attributes. [D1 ground truth]
D2 combines BTC12DBpedia with BTC12Rest. Since it is constructed by many different datasets, it is the most heterogeneous collection. Note that BTC12Rest has the highest number of attributes per entity type. [D2 ground truth]
D3 combines BTC12DBpedia with BTC12Freebase. It is the biggest collection in terms of entity descriptions, matches, entity types and comparisons. [D3 ground truth]
D4 combines BTC12DBpedia with BBCmusic. It has the lowest number of attribute-value pairs per description. [D4 ground truth]
D5 combines BTC12DBpedia with LOCAH, the smallest dataset, both in terms of triples and entity descriptions. [D5 ground truth]
D6 combines DBpedia movies and IMDB, as originally used in this work. It is the most homogeneous collection, it only contains descriptions of movies (i.e., a single entity type) using the smallest number of attributes among all collections. However, the significantly greater (even by six orders of magnitude, compared to the other collections) ratio of matches to non-matches is not typical of the collections we can find in the Web of data. We only used this colletion to verify the validity of our source code. [Link to D6 ground truth]
[airports ground truth]
[airlines ground truth]
[twitter ground truth]
[books ground truth]
[IATI subject ground truth], [IATI coverage ground truth]
[WWW2012 ground truth]

We would like to acknowledge our many collaborators who have influenced our thoughts and our understanding of this research area over the years, and the following projects for their support in our research efforts: EU FP7-ICT-2011-9 DIACHRON (Managing the Evolution and Preservation of the Data Web), EU FP7-PEOPLE- 2013-IRSES SemData (Semantic Data Management), EU FP7-ICT-318552 IdeaGarden (An Interactive Learning Environment Fostering Creativity), and LoDGoV (Generate, Manage, Preserve, Share, and Protect Resources in the Web of Data) of the Research Programme ARISTEIA (EXCELLENCE), GSRT, Ministry of Education, Greece, and the European Regional Development Fund. Finally, we would like to thank the ~okeanos GRNET cloud service.

Menu