A Unified Model for Preprocessing and Clustering Technique for Web Usage Mining
P. Senthil Pandian and S. Srinivasan
World Wide Web gives huge amount information to the internet user. It is a process of accessing hypertext documents via the internet with a large repository of web pages and links. In that web page images, text, videos and other multimedia are presented and navigate them via hyperlinks. Web log file records the information of the user accesses websites. The log file may contain some noisy and ambiguous data which may affect the data mining process. The log file should be Preprocessed to improve the quality of data. User session clustering is implemented in web personalization to understand user activities. Clustering is widely used in data mining and it is the process of grouping of users having similar browsing pattern. This paper presents a detailed discussion about the web log file format, web log file, preprocessing and clustering. Preprocessing consists of data cleaning and data filtering, user identification and session identification. Hidden Data Damage (HDD) algorithm is proposed for data cleaning and data filtering. User and session cluster is carried out to obtain aggregate clustering. Two sets of log files are collected and processed to obtain experimental results.
Keywords: Web log file, Web log file format, Data Preprocessing, Clustering