Instructor: ChengXiang Zhai
(Email: czhai AT illinois DOT edu; Phone: (217) 244-4943)
Teaching Assistants:
Note: This page provides basic information to help students decide whether they would be interested in taking the course. More up-to-date information about the course is available on the Campuswire and Moodle of the course .
Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. In contrast to non-textual data which are usually generated by physical devices, text data are generated by humans and meant to be consumed by humans. Due to the rapid growth of text data, we can no longer digest all the relevant information in a timely manner. Thus there is a pressing need for developing intelligent software tools to help people manage and make use of vast amounts of text data (“big text data”) for various tasks, especially those involving complex decision-making.
Logically, to harness big text data, we would need to first identify the relevant text data to a particular application problem (i.e., perform text data retrieval) and then analyze the identified relevant text data in more depth to extract any needed knowledge for a task (i.e. text data analysis or question answering). The most effective approaches that work well for text retrieval and text analysis are often based on statistical language models, which provide a general and robust representation of text data and enable probabilistic and statistical inferences about their content. These approaches are general, robust, and can be applied to text data in any natural language and about any topics. In particular, large language models (LLMs), notably ChatGPT, have emerged recently as a quite successful family of language models that have shown human-level performances in many natural language processing (NLP) tasks. They enable development of many interesting novel technologies for text retrieval and analysis, especially conversational information access, question answering, and summarization.
This course provides an introduction to the statistical language models that have been applied to text data retrieval and analysis, including both traditional probabilistic language models and the newly developed large (neural) language models and explores how language models can be used to build intellient agents for supporting user tasks. There will be regular assignments to help students master the materials taught in the lectures of the instructor. There will be a closed-book midterm exam, mostly covering the problems in those assignments, to be given in the middle of the semester to ensure solid mastery of the most important basic concepts and main techniques by the students.
The course will also provide students opportunities to learn and discuss frontier research topics via literature survey and class presentations in the second half of the semester.
Finally, the students will have an opportunity to finish a course project, which provides additional opportunities to explore a frontier research topic or an innovative application customized toward their individual interests. Group projects are encouraged, but not required. To accommodate different interests of students, there will be two different tracks of the course project that students can choose from: 1) Research Track, where students will work on a new research problem, carry out the research, and write a research paper. 2) Development Track, where students will develop a new software tool or application system, and make an open source contribution. There is no final exam for this course, but we will use the allocated time for the final exam by the university to hold a student project presentation conference so that students can learn from the projects of their peers.
Prerequisite: Students are expected to have a good knowledge of basic probability and statisticcs in addition to programming skills at the level of CS225 or a similar programming course. Some background in one or more of the following areas: information retrieval, machine learning, natural language processing, data mining, or databases would be a plus, but not required. If you are not sure whether you have the right background, please contact the instructor.