COT 6930 – Web Mining

(Unique # 15502)

 

Course Description:

 

This course covers the techniques used to model, analyze, and understand the internet and the web, especially the web graph and hypertext data.

 

Textbook:

Modeling the Internet and the Web – Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, John Wiley & Sons, 2003

 

References:

Mining the Web: Analysis of Hypertext and Semi Structured Data, by Soumen Chakrabarti, Morgan Kaufmann, 2002

Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribiero-Neto, Addison-Wesley, 1999

Machine Learning, by Tom Mitchell, McGraw Hill, 1997

 

Instructor:

Dr. Shi Zhong, Assistant Professor of Computer Science and Engineering

 

Contact:

zhong@cse.fau.edu, 7-3168, S&E 366

 

Goals:

To provide a comprehensive introduction to the modeling and analyzing of the internet and the web;

To introduce probabilistic methods and algorithms underlying web search engine and web crawling;

To design data mining and machine learning algorithms for the analysis of web hypertext documents, web logs, and web link structure;

To understand and model human web browsing behavior;

To understand collaborative filtering and its e-commerce applications.

 

Time and Place:

Mondays and Wednesdays, 2 – 3:20 PM, Instructional Services Bldg.

 

Office Hours:

Mondays and Wednesdays, 3:30 – 6:30 PM

 

 

Prerequisites:             STA 4821 – Stochastic Models in CS or equivalent

An understanding of basic probability, calculus, matrix/linear algebra, as well as elementary concepts in data structure and algorithms is required. Additional knowledge of graph theory and optimization is helpful but not required.

 

Topics:

  1. Modeling problems associated with the internet and the web
  2. Web graph structure
  3. Web content analysis, link analysis
  4. Web search, crawling techniques
  5. Web usage mining, modeling user browsing behavior
  6. E-commerce and web mining

 


Grading:

40% homework (4), 30% exams (1), 30% group project (1).

·        There will be four homework assignments, 10% each. Homework problems might include: programming on web page extraction, web crawling; solving probability and optimization-related mathematical questions; running and analyzing data mining experiments using publicly available software packages; conceptual analysis and design of algorithms.

·        You can choose to substitute one of the four assignments with an in-class 15-minute presentation. The presentation can be a critical analysis of a research paper of your interest or a status report of your project. You are required to prepare slides for the presentation, which includes a 12-minute talk and a 3-minute Q&A session. You will be graded based on clarity and handling of questions.

·        One mid-term exam will be held on Wednesday, October 12, 2005, in class. The midterm will cover lectures up to (including) web link analysis. That means: web technologies related to web mining, web graph, textual analysis, and link analysis.

·        A project group must consist of two or three students enrolled in the class. Projects will be graded on a 100 point scale (but count 30% of final grade) with the following point distributions: 5 points for format, 5 points for proposal, 30 points for report writing (organization, smoothness, and structure), 60 points for report content (originality, comprehensiveness, and completeness, etc).

 

Project requirements:

Your course project must address a well-defined problem, which can be (a) theoretical improvement and analysis of an existing algorithm; (b) modifying and improving an existing algorithm, with empirical validation; (c) using an existing data mining algorithm to solve a practical problem; (d) comparative study (empirically or theoretically) of several existing methods for a specific problem; (e) implementation of a web mining tool. If you are not sure about what you have in mind, try to discuss with me before turning in your proposal. Some example projects and resources are available at:

http://polaris.cse.fau.edu/~zhong/web/projects.htm.

 

A one-page (or maximum two pages, single space) project proposal, containing problem description, motivation, proposed work, and references, is due on Wednesday, September 28.

 

Project reports are due in the final week (Wednesday, December 7). All reports should be formatted in double space, with 11pt or 12pt font size and one inch margin on all sides. Page limit: maximum 10 pages (excluding source code). Report structure should be similar to a technical paper you have seen in technical conference or journal papers.

 

Policy on cheating:

Any form of cheating will not be accepted. You are allowed to discuss with others on your homework assignments but cannot copy programs or writing from others. In your project report, you should pay special attention to describing in detail what is existing work and what is your work. (It is OK to quote a few sentences or a paragraph from published resources.)

 

Reminder:

August 29, 2005 – last day to drop without consequences

September 28, 2005 – project proposals due

October 14, 2005 – last day to drop without getting an F grade

December 7, 2005 – final project report due