How These Notes Are Organized

Note

This site is a work-in-progress and is actively being developed. Please check back frequently for updates.

These notes are designed to help you contribute to original research at applied venues like CLEF and Kaggle. They are structured in a dual-track format to help you develop both the strategic and the hands-on skills needed for success in research competitions.

Two Tracks

The content is split into two main themes that run concurrently. You can think of them as the "why" and the "how" of competition research.

  • Competition Strategy and Concepts: This track focuses on the research process. It covers how to analyze different competition platforms (Kaggle, CLEF, KDD Cup, NeurIPS), dissect past solutions to find research gaps, form effective teams, and develop a strong research proposal.

  • Applied Methods and Tooling: This track is all about hands-on, practical skills. It provides guides for using Georgia Tech's PACE high-performance computing (HPC) environment, including tools like SLURM and Apptainer. You'll also find walkthroughs for building ML/IR pipelines that use embeddings, transfer learning, and semantic search with libraries like PyTorch, Hugging Face, and Faiss.

By the end, you'll be well-equipped to propose and execute original research for competitive academic workshops.

Learning Outcomes

By working through these notes and participating in club activities, you'll learn to:

  • Critically evaluate research from platforms like CLEF and Kaggle to identify strengths and opportunities.
  • Design a structured research proposal that frames a problem and outlines a clear methodology.
  • Apply core ML/IR concepts like embeddings, transfer learning, and evaluation metrics (MAP/NDCG).
  • Leverage HPC tools like PACE, SLURM, and Apptainer for efficient and reproducible experiments.
  • Collaborate effectively using Git/GitHub and communicate your technical findings.

Prerequisites

To get the most out of these notes, it helps to have some background knowledge and be aware of the expected time commitment.

  • Background: These notes assume you have some familiarity with machine learning and information retrieval concepts, perhaps from courses like Machine Learning, Deep Learning, or NLP. Intermediate proficiency in Python and experience with libraries like NumPy and Pandas is required. Some exposure to the Linux command line will be necessary for using PACE.

  • Time Commitment: This is a hands-on group that requires active participation. You should expect to spend approximately 3-4 hours per week engaging with the material, which includes a 1-hour synchronous online meeting and 2-3 hours of asynchronous work on your own.

Required Tools

You'll need a reliable computer with a stable internet connection and a few key pieces of software:

All readings will consist of online documentation, research papers, and competition write-ups. No textbook purchase is required.