Integration of Deep Web Sources: A Distributed Information Retrieval Approach

In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics (WIMS-17), 2017., pages 33:1 -- 33:4, ISBN 978-1-4503-5225-3, Doi 10.1145/3102254.3102291, ACM, New york, USA, 2017.


Abstract:
The Deep Web consists of those structured data that are available as dynamically generated pages, typically requested through HTML forms. Deep Web pages cannot be indexed by search engines, and are notoriously difficult to query and integrate due to the limited access that they offer. We propose a novel framework for integrating Deep Web sources by means of a mediated schema that represent the underlying, distributed sources. Our goal is to compute answers to queries posed on the mediated schema. To this aim, we propose the use of techniques from the area of Distributed Information Retrieval. We discuss a novel approach to automated sampling, size estimation and selection of Deep Web sources, as well as a technique for merging result lists.