This article gives an overview of recent efforts focusing on integrating heterogeneous data using Knowledge Graphs. I introduce a pipeline consisting of five steps to integrate semi-structured or unstructured content. I discuss some of the key applications of this pipeline through three use-cases, and present the lessons learnt along the way while designing and building data integration systems.
K.Aberer, A.Boyarsky, P.Cudré-Mauroux, G.Demartini and O.Ruchayskiy, Sciencewise: A web-based interactive semantic platform for scientific collaboration, in: 10th International Semantic Web Conference (ISWC 2011-Demo), Bonn, Germany, 2011.
2.
S.Decker, M.Erdmann, D.Fensel and R.Studer, Ontobroker: Ontology based access to distributed and semi-structured information, in: Database Semantics: Semantic Issues in Multimedia Systems, R.Meersman, Z.Tari and S.Stevens, eds, Springer US, Boston, MA, 1999, pp. 351–369. doi:10.1007/978-0-387-35561-0_20.
3.
G.Demartini, D.E.Difallah and P.Cudré-Mauroux, ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking, in: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, ACM, New York, NY, USA, 2012, pp. 469–478, ISBN 978-1-4503-1229-5. doi:10.1145/2187836.2187900.
4.
G.Demartini, D.E.Difallah and P.Cudré-Mauroux, Large-scale linked data integration using probabilistic reasoning and crowdsourcing, VLDB J.22(5) (2013), 665–687. doi:10.1007/s00778-013-0324-z.
5.
D.E.Difallah, M.Catasta, G.Demartini, P.G.Ipeirotis and P.Cudré-Mauroux, The dynamics of micro-task crowdsourcing: The case of Amazon MTurk, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2015, pp. 238–247, ISBN 978-1-4503-3469-3. doi:10.1145/2736277.2741685.
6.
A.Lutov, S.Roshankish, M.Khayati and P.Cudre-Mauroux, StaTIX – statistical type inference on linked data, in: 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 2253–2262. doi:10.1109/BigData.2018.8622285.
7.
R.Mavlyutov, C.Curino, B.Asipov and P.Cudré-Mauroux, Dependency-driven analytics: A compass for uncharted data oceans, in: CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Online Proceedings, Chaminade, CA, USA, January 8–11, 2017, 2017.
8.
J.Plu, R.Prokofyev, A.Tonon, P.Cudré-Mauroux, D.E.Difallah, R.Troncy and G.Rizzo, Sanaphor++: Combining deep neural networks with semantics for coreference resolution, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018, 2018.
9.
A.Poggi, M.Rodriguez-Muro and M.Ruzzi, Ontology-based database access with DIG-mastro and the OBDA plugin for protégé (demo description), in: OWLED, 2008.
10.
J.Pound, P.Mika and H.Zaragoza, Ad-hoc object retrieval in the web of data, in: Proceedings of the 19th International Conference on World Wide Web, WWW ’10, ACM, New York, NY, USA, 2010, pp. 771–780, ISBN 978-1-60558-799-8. doi:10.1145/1772690.1772769.
11.
R.Prokofyev, G.Demartini and P.Cudré-Mauroux, Effective named entity recognition for idiosyncratic web collections, in: Proceedings of the 23rd International Conference on World Wide Web, WWW ’14, ACM, New York, NY, USA, 2014, pp. 397–408, ISBN 978-1-4503-2744-2. doi:10.1145/2566486.2568013.
12.
R.Prokofyev, A.Tonon, M.Luggen, L.Vouilloz, D.E.Difallah and P.Cudré-Mauroux, SANAPHOR: Ontology-based coreference resolution, in: The Semantic Web – ISWC 2015, M.Arenas, O.Corcho, E.Simperl, M.Strohmaier, M.d’Aquin, K.Srinivas, P.Groth, M.Dumontier, J.Heflin, K.Thirunarayan, K.Thirunarayan and S.Staab, eds, Springer International Publishing, Cham, 2015, pp. 458–473. doi:10.1007/978-3-319-25007-6_27.
13.
J.F.Sequeda and D.P.Miranker, A pay-as-you-go methodology for ontology-based data access, IEEE Internet Computing21(2) (2017), 92–96. doi:10.1109/MIC.2017.46.
14.
W.Shen, J.Wang and J.Han, Entity linking with a knowledge base: Issues, techniques, and solutions, IEEE Transactions on Knowledge and Data Engineering27(2) (2015), 443–460. doi:10.1109/TKDE.2014.2327028.
15.
A.Smirnova, J.Audiffren and P.Cudre-Mauroux, APCNN: Tackling class imbalance in relation extraction through aggregated piecewise convolutional neural networks, in: Swiss Conference on Data Science (SDS), 2019, pp. 63–68. doi:10.1109/SDS.2019.000-6.
16.
A.Smirnova and P.Cudré-Mauroux, Relation extraction using distant supervision: A survey, ACM Comput. Surv.51(5) (2018), 106:1–106:35. doi:10.1145/3241741.
17.
A.Tonon, M.Catasta, G.Demartini and P.Cudré-Mauroux, Fixing the domain and range of properties in linked data by context disambiguation, in: Proceedings of the Workshop on Linked Data on the Web, LDOW, 2015.
18.
A.Tonon, M.Catasta, G.Demartini, P.Cudré-Mauroux and K.Aberer, TRank: Ranking entity types using the web of data, in: The Semantic Web – ISWC 2013, H.Alani, L.Kagal, A.Fokoue, P.Groth, C.Biemann, J.X.Parreira, L.Aroyo, N.Noy, C.Welty and K.Janowicz, eds, Springer, Berlin, Heidelberg, 2013, pp. 640–656, ISBN 978-3-642-41335-3.
19.
A.Tonon, M.Catasta, R.Prokofyev, G.Demartini, K.Aberer and P.Cudre-Mauroux, Contextualized ranking of entity types based on knowledge graphs, Journal of Web Semantics37–38 (2016), 170–183. doi:10.1016/j.websem.2015.12.005.
20.
A.Tonon, P.Cudré-Mauroux, A.Blarer, V.Lenders and B.Motik, ArmaTweet: Detecting events by semantic tweet analysis, in: The Semantic Web, E.Blomqvist, D.Maynard, A.Gangemi, R.Hoekstra, P.Hitzler and O.Hartig, eds, Springer International Publishing, Cham, 2017, pp. 138–153. doi:10.1007/978-3-319-58451-5_10.
21.
A.Tonon, G.Demartini and P.Cudré-Mauroux, Combining inverted indices and structured search for ad-hoc object retrieval, in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, ACM, New York, NY, USA, 2012, pp. 125–134, ISBN 978-1-4503-1472-5. doi:10.1145/2348283.2348304.