Klemens BÖHM
Karlsruhe Institute of Technology
Many scientific databases nowadays are publicly available for querying and advanced data analytics. One prominent example is the Sloan Digital Sky Survey (SDSS) SkyServer, which offers data to astronomers, scientists, and the general public. With a large user base, it is worthwhile to identify the areas of the data space that are of interest to many users. This is beneficial for understanding the public focus, and the trending research directions on the subject described by the database, i.e., astronomy in the case of SkyServer. In a current research project, we study the problem of extracting and analyzing access areas of user queries, by analyzing the query logs of the database. To our knowledge, both the concept of access areas and how to extract them have not been studied before. We address these shortcomings by first proposing a novel notion of access area which is independent of any specific database state. It should allow to detect interesting areas of the data space, regardless if they have already existed in the database content. Second, we present a detailed mapping of our notion to different query types. Using our mapping on the SkyServer query log, we obtain a transformed data set. Third, we propose a new distance function to analyze this data set. Applying DBScan with our distance function, we arrive at access areas that are interesting from the perspective of an astronomer. These areas occupy only a small fraction (in some cases less than 1%) of the data space and are accessed by many users. Some frequently accessed areas even do not exist in the space spanned by available objects.