COT 6930 – Web Mining
(Unique # 15502)
|
Course Description: |
This course covers the techniques used to model, analyze, and understand the internet and the web, especially the web graph and hypertext data. |
|
Textbook: |
Modeling the Internet and the Web – Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, John Wiley & Sons, 2003 |
|
References: |
Mining the Web: Analysis of Hypertext and Semi Structured Data, by Soumen Chakrabarti, Morgan Kaufmann, 2002 Modern Information Retrieval, by Ricardo
Baeza-Yates and Berthier Ribiero-Neto, Addison-Wesley, 1999 Machine Learning,
by Tom Mitchell, McGraw Hill, 1997 |
|
Instructor: |
Dr. Shi Zhong, Assistant Professor of Computer Science and Engineering |
|
Contact: |
zhong@cse.fau.edu, 7-3168, S&E 366 |
|
Goals: |
To provide a comprehensive introduction to the modeling and analyzing of the internet and the web; To introduce probabilistic methods and algorithms underlying web search engine and web crawling; To design data mining and machine learning algorithms for the analysis of web hypertext documents, web logs, and web link structure; To understand and model human web browsing behavior; To understand collaborative filtering and its e-commerce applications. |
|
Time and Place: |
Mondays and Wednesdays, |
|
Office Hours: |
Mondays and Wednesdays, |
Prerequisites: STA 4821 – Stochastic Models in CS or equivalent
An understanding of basic probability, calculus, matrix/linear algebra, as well as elementary concepts in data structure and algorithms is required. Additional knowledge of graph theory and optimization is helpful but not required.
Topics:
Grading:
40% homework (4), 30% exams (1), 30% group project (1).
· There will be four homework assignments, 10% each. Homework problems might include: programming on web page extraction, web crawling; solving probability and optimization-related mathematical questions; running and analyzing data mining experiments using publicly available software packages; conceptual analysis and design of algorithms.
· You can choose to substitute one of the four assignments with an in-class 15-minute presentation. The presentation can be a critical analysis of a research paper of your interest or a status report of your project. You are required to prepare slides for the presentation, which includes a 12-minute talk and a 3-minute Q&A session. You will be graded based on clarity and handling of questions.
·
One mid-term exam will be held on
· A project group must consist of two or three students enrolled in the class. Projects will be graded on a 100 point scale (but count 30% of final grade) with the following point distributions: 5 points for format, 5 points for proposal, 30 points for report writing (organization, smoothness, and structure), 60 points for report content (originality, comprehensiveness, and completeness, etc).
Project requirements:
Your course project must address a well-defined problem, which can be (a) theoretical improvement and analysis of an existing algorithm; (b) modifying and improving an existing algorithm, with empirical validation; (c) using an existing data mining algorithm to solve a practical problem; (d) comparative study (empirically or theoretically) of several existing methods for a specific problem; (e) implementation of a web mining tool. If you are not sure about what you have in mind, try to discuss with me before turning in your proposal. Some example projects and resources are available at:
http://polaris.cse.fau.edu/~zhong/web/projects.htm.
A one-page (or maximum two pages, single space) project proposal, containing problem description, motivation, proposed work, and references, is due on Wednesday, September 28.
Project reports are due in the final week (Wednesday, December 7). All reports should be formatted in double space, with 11pt or 12pt font size and one inch margin on all sides. Page limit: maximum 10 pages (excluding source code). Report structure should be similar to a technical paper you have seen in technical conference or journal papers.
Policy on cheating:
Any form of cheating will not be accepted. You are allowed to discuss
with others on your homework assignments but cannot copy programs or writing
from others. In your project report, you should pay special attention to
describing in detail what is existing work and what is
your work. (It is OK to quote a few sentences or a paragraph from published
resources.)
Reminder: