Instructor: ChengXiang Zhai
Full-Time Teaching Assistants: Qihao Shao (Head TA), Bingjie Jiang
Part-Time Teaching Assistants: Chase Geigle, Eddie Huang, Dominic Seyler, Sheng Wang
Time & Place: 11:00am--12:15pm Tuesdays & Thursdays, 1404 Siebel Center
Note: This page provides basic information to help students decide whether they would be interested in taking the course. More up-to-date information about the course is available on the Course Piazza Forum .
This course covers general computational techniques for managing and analyzing large amounts of text data that can help users manage and make use of text data in all kinds of applications. Text data include all data in the form of natural language text (e.g., English text or Chinese text): all the web pages, social media data such as tweets, news, scientific literature, emails, government documents, and many other kinds of enterprise data. Text data play an essential role in our lives. Since we communicate using natural languages, we produce and consume a large amount of text data (i.e. ,"big text data") every day on all kinds of topics. The explosive growth of text data makes it very difficult, or impossible, for people to consume all the relevant text data in a timely manner. Thus, there is an urgent need for developing intelligent text information systems to help people access, digest, and make use of all the needed relevant information quickly and accurately at any time.
Logically, to harness big text data, we would need to first identify the relevant text data to a particular application problem (i.e., perform information retrieval) and then analyze the identified relevant text data in more depth to extract any needed knowledge for a task (i.e. perform text analysis/mining). The first step is usually supported by a search engine, while the second step by various text analysis tools.
In this course, you will learn the underlying technologies of both search engines and text analysis tools. You will be able to learn the basic concepts, principles, and major algorithms for managing, analyzing, and mining text data as well as obtain handson experience with using some information retrieval and text mining toolkits to experiment with algorithms and develop your own text information system applications. You will also have an opportunity to work on a course project on a topic of your choice related to the course materials. The course emphasizes basic principles and practically useful algorithms, especially those general and robust algorithms that can be applied to any natural language text data. Topics to be covered include, among others, text analysis, information retrieval models, recommender systems, text categorization and clustering, topic mining and analysis, search engine evaluation, search engine design and implementation, and applications in Web search and mining.
Leveraging the lecture videos of two MOOCs on Coursera (i.e., Text Retrieval and Search Engines and Text Mining and Analytics), the course will be offered with a "blended classroom model." The class meetings will not be used for the instructor to deliver lectures, but instead to help students digest the content that they would learn by watching lecture videos before a class meeting. The class meetings will also be used to help students finish assignments and course projects as well as other interactive activities to facilitate learning. There will be weekly quizzes given at class meetings and several assignments that involve a small amount of programming and experimentation with data sets. Grading is based on the quizzes, assignments, and a course project. Those who registered the course for 4 credit hours are required to finish a literature survey on a frontier topic.
ChengXiang Zhai, Sean Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM and Morgan & Claypool Publishers, 2016. (click here to read the book online)
Students should come with good programming skills. CS225 or CS400 or an equivalent course is required. Knowledge of basic probability and statistics is a plus. If you are not sure whether you have the right background, please contact the instructor.
The course is lecture-based, but all the lectures are delivered via online videos. The lecture videos are available through two MOOCs on Coursera: 1) Text Retrieval and Search Engines:https://www.coursera.org/learn/text-retrieval 2) Text Mining and Analytics: https://www.coursera.org/learn/text-mining. For convenience, the lecture videos are also all available via Compass.
The class will meet only once in each week with the other class meeting slot being used by the students to watch videos. Specifically, in each week, the students would watch lecture videos in the middle of the week and submit a brief summary as well as questions or topics that they want to discuss at the first class meeting in the subsequent week, when the instructor will answer the questions that haven't been answered on Piazza, review any difficult topics suggested by the students, or help students in other ways such as helping completing assignments or projects. Every week, at the first class meeting, there will be a short quiz to test the materials watched by the students nearly two weeks before (e.g., the quiz given in Week X would cover content watched by students in week X-2). Once the material in a week is covered, it will not be covered again later in any quiz. There will be no exam.
The quizzes ensure that the students have a good mastery of all the essential concepts, principles, and algorithms. There are individual assignments (possibly also group assignments as appropriate), which often involve using a software toolkit to implement an algorithm and/or experiment with real text data. The assignments ensure that the students acquire practical skills of using existing toolkits to do experiments and build application tools. There will also be a course project which the students can work in a team. The project is to ensure that the students have an opportunity to synthesize multiple pieces of knowledge learned from the course and apply the learned knowledge and skills to solve an interesting real-world problem. Students taking the course for 4 credit hours also need to finish a literature review.
A+: [95,100] A: [90,94] A-: [85, 89] B+: [80, 84] B: [75, 79] B-: [70,74] C: [60, 69] D: [55,59] F: <55Students are strongly encouraged to help each other through actively answering questions for each other on Piazza. The most active contributors on Piazza will receive up to 5 points extra credit, which would help move your grade up by one bracket.