Syllabus

Note

This site is a work-in-progress and is actively being developed. Please check back frequently for updates.

Description

This seminar prepares students for original research contributions at evaluation-focused venues like CLEF. In a dual-track format, participants will first critically analyze the AI/ML/IR applied research landscape (Kaggle, KDD, NeurIPS, TREC, CLEF) to identify viable shared tasks, foster team formation, and initiate research proposals. Simultaneously, a hands-on track develops essential skills for using Georgia Tech's PACE supercomputing cluster, including SLURM, Apptainer, and building ML/IR pipelines with PyTorch and Hugging Face.

  • Track A (Applied Research Competition Discussion): Analyze various research competition platforms (e.g., Kaggle, CLEF, KDD Cup, NeurIPS Competitions, TREC), dissect methodologies and evaluation strategies from competition papers and reports, identify research gaps, and culminate in the formation of teams and development of a preliminary proposal for participation in a CLEF 2025 shared task.
  • Track B (PACE & ML/IR Pipeline Development): Gain practical experience using the Georgia Tech PACE HPC environment (OnDemand, SLURM, Apptainer), build and evaluate a core ML/IR pipeline involving embeddings, transfer learning, fine-tuning, and semantic search, and utilize essential Python libraries (e.g., PyTorch, Hugging Face, scikit-learn, FAISS) and workflow tools.

The seminar culminates in students being equipped to propose and execute original research for competitive academic workshops.

Learning Outcomes

Upon successful completion of this seminar, students will be able to:

  1. Critically Evaluate Research and Platforms Analyze the structure, methodologies, and evaluation paradigms of applied AI/ML platforms (e.g., CLEF, Kaggle, NeurIPS), and critique diverse research outputs to identify strengths, weaknesses, and research opportunities.
  2. Design Research Proposals Develop structured research proposals for shared tasks, including problem framing, methodology, evaluation plans, and collaboration strategies.
  3. Apply Core ML/IR Concepts Understand key components of ML and information retrieval pipelines, such as embeddings, transfer learning, and evaluation metrics like MAP/NDCG.
  4. Leverage HPC and Engineering Tools Utilize the PACE HPC environment and foundational tools (SLURM, Apptainer, MLflow/WandB) for efficient experimentation and reproducibility.
  5. Collaborate and Communicate Effectively Use Git/GitHub for project collaboration and present technical findings clearly in both written and oral formats.

Prerequisites and Expectations

  • Background: Familiar with machine learning and information retrieval concepts. Taken courses like Machine Learning, Deep Learning, Natural Language Processing, or Computer Vision.
  • Programming: Intermediate proficiency in Python programming is required, including experience with libraries like NumPy, Pandas, and ideally some exposure to PyTorch or TensorFlow. Familiarity with basic command-line operations in a Linux environment is expected for PACE usage.
  • Time Commitment: This is a seminar-style course requiring active participation. Expect to spend approximately 3-4 hours per week, including a 1-hour synchronous online meeting and 2-3 hours of asynchronous hands-on work, readings, and assignments. This aligns with typical OMSCS course expectations.

Required Materials & Technology

  • Hardware: A reliable laptop or desktop computer meeting Georgia Tech's minimum requirements for online programs. Access to a stable, high-speed internet connection.
  • Software:
  • Modern web browser (Chrome, Firefox recommended).
  • VSCode with Remote SSH extension.
  • Access to Georgia Tech's PACE HPC environment (provided).
  • GitHub account
  • Readings: Course materials will primarily consist of online documentation, research papers (provided or accessed via GT Library), competition descriptions, and solution write-ups. No mandatory textbook purchase is required.

Schedule

Track A: Applied Research Competition Discussion

Date Week # Track A Topic Deliverables
2025-08-18 1 The "Why" of Applied Research & Initial Exploration
2025-08-25 2 Deeper Dive into Research Platforms & Task Analysis
2025-09-01 3 Analyzing Research Papers from Competitions Labor Day
2025-09-08 4 Kaggle Solution Deconstruction & Strategy CLEF Madrid
2025-09-15 5 CLEF & Academic Competition Methodology Review
2025-09-22 6 Identifying Research Gaps & Opportunities Across Platforms
2025-09-29 7 Initial CLEF Task Brainstorming & Focus
2025-10-06 8 Fall Break
2025-10-13 9 CLEF Task Shortlisting & Focused Literature Reviewing
2025-10-20 10 CLEF Team Formation Dynamics & Roles
2025-10-27 11 CLEF Proposal Structuring & Methodology Brainstorming
2025-11-03 12 CLEF Proposal Peer Review Workshop & Refinement
2025-11-10 13 CLEF Proposal Intensive & Finalization
2025-11-17 14 CLEF Team Proposal Presentations
2025-11-24 15 Thanksgiving
2025-12-01 16 ARC spring team formation
2025-12-08 17 End of term

Track B: PACE & ML/IR Pipeline Development

Date Week # Track B Topic Deliverables
2025-08-18 1 Git/GitHub & Initial PACE Onboarding
2025-08-25 2 VSCode Remote to PACE & Scientific Python Essentials
2025-09-01 3 Embeddings/Representations & Introduction to SLURM Labor Day
2025-09-08 4 EDA on Embeddings & Advanced SLURM Usage CLEF Madrid
2025-09-15 5
2025-09-22 6 Transfer Learning with PyTorch & Hugging Face Trainer
2025-09-29 7
2025-10-06 8 Fall Break
2025-10-13 9 Parameter-Efficient Fine-Tuning (PEFT) in Practice
2025-10-20 10 Semantic Search, IR Metrics (MAP/NDCG), ANN & Reranking
2025-10-27 11 Apptainer for Advanced & Multimodal Workloads
2025-11-03 12 Experiment Tracking (WandB/MLflow) & Workflow Management
2025-11-10 13 HPC Job Monitoring (GPU), Debugging & PyTorch Memory
2025-11-17 14 Compiling Module-wise Report & Presentation Preparation
2025-11-24 15 Thanksgiving
2025-12-01 16 ARC spring team formation
2025-12-08 17 End of term