Data mining, or knowledge discovery in databases, has during the last few years emerged as one of the most exciting fields in Computer Science. Data mining aims at finding useful regularities in large data sets. Interest in the field is motivated by the growth of computerized data collections which are routinely kept by many organizations and commercial enterprises, and by the high potential value of patterns discovered in those collections. For instance, bar code readers at supermarkets produce extensive amounts of data about purchases. An analysis of this data can reveal previously unknown, yet useful information about the shopping behavior of the customers.

Data mining refers to a set of techniques that have been designed to efficiently find interesting pieces of information or knowledge in large amounts of data. Association rules, for instance, are a class of patterns that tell which products tend to be purchased together. There is currently a large commercial interest in the area, both for the development of data mining software and for the offering of consulting services on data mining, with a market for the former estimated at over 5 billion U.S. dollars.

In this course, we explore how this interdisciplinary field brings together techniques from databases, statistics, machine learning, and information retrieval. We will discuss the main data mining methods currently used, including data warehousing and data cleaning, clustering, classification, association rules mining, query flocks, text indexing, and searching algorithms, how search engines rank pages, and recent techniques for web mining. Designing algorithms for these tasks is difficult because the input data sets are very large, and the tasks may be very complex. One of the main focuses in the field is the integration of these algorithms with relational databases and the mining of information from semi-structured data, and we will examine the additional complications that come up in this case.



Skill Level: Beginner