DS@GT ARC Seminar Notes
Repository | https://github.com/dsgt-arc/arc-seminar-notes |
Introduction
DS@GT ARC Seminar Notes
Note
This site is a work-in-progress and is actively being developed. Please check back frequently for updates.
This seminar prepares students for original research contributions at evaluation-focused venues like CLEF. In a dual-track format, participants will first critically analyze the AI/ML/IR applied research landscape (Kaggle, KDD, NeurIPS, TREC, CLEF) to identify viable shared tasks, foster team formation, and initiate research proposals. Simultaneously, a hands-on track develops essential skills for using Georgia Tech's PACE supercomputing cluster, including SLURM, Apptainer, and building ML/IR pipelines with PyTorch and Hugging Face.
See the DS@GT ARC homepage for more information about the club.
Syllabus
Note
This site is a work-in-progress and is actively being developed. Please check back frequently for updates.
Description
This seminar prepares students for original research contributions at evaluation-focused venues like CLEF. In a dual-track format, participants will first critically analyze the AI/ML/IR applied research landscape (Kaggle, KDD, NeurIPS, TREC, CLEF) to identify viable shared tasks, foster team formation, and initiate research proposals. Simultaneously, a hands-on track develops essential skills for using Georgia Tech's PACE supercomputing cluster, including SLURM, Apptainer, and building ML/IR pipelines with PyTorch and Hugging Face.
- Track A (Applied Research Competition Discussion): Analyze various research competition platforms (e.g., Kaggle, CLEF, KDD Cup, NeurIPS Competitions, TREC), dissect methodologies and evaluation strategies from competition papers and reports, identify research gaps, and culminate in the formation of teams and development of a preliminary proposal for participation in a CLEF 2025 shared task.
- Track B (PACE & ML/IR Pipeline Development): Gain practical experience using the Georgia Tech PACE HPC environment (OnDemand, SLURM, Apptainer), build and evaluate a core ML/IR pipeline involving embeddings, transfer learning, fine-tuning, and semantic search, and utilize essential Python libraries (e.g., PyTorch, Hugging Face, scikit-learn, FAISS) and workflow tools.
The seminar culminates in students being equipped to propose and execute original research for competitive academic workshops.
Learning Outcomes
Upon successful completion of this seminar, students will be able to:
- Critically Evaluate Research and Platforms Analyze the structure, methodologies, and evaluation paradigms of applied AI/ML platforms (e.g., CLEF, Kaggle, NeurIPS), and critique diverse research outputs to identify strengths, weaknesses, and research opportunities.
- Design Research Proposals Develop structured research proposals for shared tasks, including problem framing, methodology, evaluation plans, and collaboration strategies.
- Apply Core ML/IR Concepts Understand key components of ML and information retrieval pipelines, such as embeddings, transfer learning, and evaluation metrics like MAP/NDCG.
- Leverage HPC and Engineering Tools Utilize the PACE HPC environment and foundational tools (SLURM, Apptainer, MLflow/WandB) for efficient experimentation and reproducibility.
- Collaborate and Communicate Effectively Use Git/GitHub for project collaboration and present technical findings clearly in both written and oral formats.
Prerequisites and Expectations
- Background: Familiar with machine learning and information retrieval concepts. Taken courses like Machine Learning, Deep Learning, Natural Language Processing, or Computer Vision.
- Programming: Intermediate proficiency in Python programming is required, including experience with libraries like NumPy, Pandas, and ideally some exposure to PyTorch or TensorFlow. Familiarity with basic command-line operations in a Linux environment is expected for PACE usage.
- Time Commitment: This is a seminar-style course requiring active participation. Expect to spend approximately 3-4 hours per week, including a 1-hour synchronous online meeting and 2-3 hours of asynchronous hands-on work, readings, and assignments. This aligns with typical OMSCS course expectations.
Required Materials & Technology
- Hardware: A reliable laptop or desktop computer meeting Georgia Tech's minimum requirements for online programs. Access to a stable, high-speed internet connection.
- Software:
- Modern web browser (Chrome, Firefox recommended).
- VSCode with Remote SSH extension.
- Access to Georgia Tech's PACE HPC environment (provided).
- GitHub account
- Readings: Course materials will primarily consist of online documentation, research papers (provided or accessed via GT Library), competition descriptions, and solution write-ups. No mandatory textbook purchase is required.
Schedule
Track A: Applied Research Competition Discussion
Date | Week # | Track A Topic | Deliverables |
---|---|---|---|
2025-08-18 | 1 | The "Why" of Applied Research & Initial Exploration | |
2025-08-25 | 2 | Deeper Dive into Research Platforms & Task Analysis | |
2025-09-01 | 3 | Analyzing Research Papers from Competitions | Labor Day |
2025-09-08 | 4 | Kaggle Solution Deconstruction & Strategy | CLEF Madrid |
2025-09-15 | 5 | CLEF & Academic Competition Methodology Review | |
2025-09-22 | 6 | Identifying Research Gaps & Opportunities Across Platforms | |
2025-09-29 | 7 | Initial CLEF Task Brainstorming & Focus | |
2025-10-06 | 8 | Fall Break | |
2025-10-13 | 9 | CLEF Task Shortlisting & Focused Literature Reviewing | |
2025-10-20 | 10 | CLEF Team Formation Dynamics & Roles | |
2025-10-27 | 11 | CLEF Proposal Structuring & Methodology Brainstorming | |
2025-11-03 | 12 | CLEF Proposal Peer Review Workshop & Refinement | |
2025-11-10 | 13 | CLEF Proposal Intensive & Finalization | |
2025-11-17 | 14 | CLEF Team Proposal Presentations | |
2025-11-24 | 15 | Thanksgiving | |
2025-12-01 | 16 | ARC spring team formation | |
2025-12-08 | 17 | End of term |
Track B: PACE & ML/IR Pipeline Development
Date | Week # | Track B Topic | Deliverables |
---|---|---|---|
2025-08-18 | 1 | Git/GitHub & Initial PACE Onboarding | |
2025-08-25 | 2 | VSCode Remote to PACE & Scientific Python Essentials | |
2025-09-01 | 3 | Embeddings/Representations & Introduction to SLURM | Labor Day |
2025-09-08 | 4 | EDA on Embeddings & Advanced SLURM Usage | CLEF Madrid |
2025-09-15 | 5 | ||
2025-09-22 | 6 | Transfer Learning with PyTorch & Hugging Face Trainer | |
2025-09-29 | 7 | ||
2025-10-06 | 8 | Fall Break | |
2025-10-13 | 9 | Parameter-Efficient Fine-Tuning (PEFT) in Practice | |
2025-10-20 | 10 | Semantic Search, IR Metrics (MAP/NDCG), ANN & Reranking | |
2025-10-27 | 11 | Apptainer for Advanced & Multimodal Workloads | |
2025-11-03 | 12 | Experiment Tracking (WandB/MLflow) & Workflow Management | |
2025-11-10 | 13 | HPC Job Monitoring (GPU), Debugging & PyTorch Memory | |
2025-11-17 | 14 | Compiling Module-wise Report & Presentation Preparation | |
2025-11-24 | 15 | Thanksgiving | |
2025-12-01 | 16 | ARC spring team formation | |
2025-12-08 | 17 | End of term |
Competition
An Overview of Competition Research
This section of the seminar focuses on the process of conducting research and is aimed at individuals who have experience with machine learning or data science projects but may be new to formal research. Many students in the OMSCS program possess strong analytical and technical skills from their professional careers that are directly transferable to applied research competitions. This guide is designed to bridge the gap between that practical experience and the structured process of academic and competition-based research.
The following chapters will walk through the essential soft skills required to navigate the research landscape. We will begin by outlining useful background knowledge before discussing how to choose a research venue, focusing on Kaggle, the CLEF Conference, and workshops associated with conferences like TREC, NTCIR, KDD, and NeurIPS. We will then cover how to perform a literature review to understand the current state of the art for a given problem.
Subsequently, we will detail how to develop a research proposal, a crucial step for anyone looking to lead or recruit for a team. A significant portion is dedicated to team formation, as assembling a group with the right mix of interests, skills, and time commitment is often the most challenging aspect of the process. While this guide will describe the paper writing process, it assumes some familiarity from a paper-heavy course. Please note that a deep dive into the specifics of conducting experiments is considered beyond the scope of this introductory seminar and is reserved for the active competition teams. This chapter provides the framework for identifying where you are on the research frontier and what opportunities are available to you.
Useful Background
This section outlines the skills and experience that are beneficial for contributing effectively to a research competition team. While formal research experience is not a prerequisite, a strong foundation in related areas is essential.
Foundational Experience and Eligibility
Official eligibility extends to all Georgia Tech students (undergraduate, graduate, online) and alumni. Beyond this, we look for individuals who have demonstrated experience with complex projects, either through project-heavy coursework or full-time software engineering roles. The ability to work with large codebases and navigate complex systems is crucial.
Equally important are transferable organizational skills. Experience in roles similar to program management, where you are responsible for scheduling meetings, tracking progress, and communicating requirements and timelines, is highly valuable. These skills are fundamental to the successful coordination of a research team.
Core Technical Skills
A broad set of technical skills underpins success in applied research competitions. While no single person is expected to be an expert in all areas, proficiency in several is expected.
Mathematical Foundations
A working knowledge of certain mathematical concepts is frequently required. An understanding of linear algebra is essential for working with the embedding spaces common in modern machine learning, including concepts like dimensionality reduction. Probability and statistics are critical for designing experiments and determining if the results are statistically significant. A basic understanding of calculus is also beneficial.
Machine Learning and Data Engineering
You should be familiar with fundamental machine learning concepts such as the distinction between classification and regression, and the purpose of data splits. Proficiency with the modern machine learning stack is key, including PyTorch and the Hugging Face ecosystem. It is helpful to understand core concepts behind large language models, such as the Transformer architecture, attention mechanisms, and fine-tuning strategies like parameter-efficient fine-tuning (PEFT). Strong data engineering skills are also highly transferable. This includes the ability to build data pipelines (e.g., converting data from XML to Parquet), parallelize jobs for distributed systems, and work with datasets that are larger than memory.
Information Retrieval and Systems
Many competitions involve information retrieval. Experience with search concepts like BM25 and cosine similarity, as well as search systems like Faiss, Anserini, or Elasticsearch, is a significant advantage. General software and systems engineering proficiency is non-negotiable. You must be comfortable with the Linux terminal, version control with Git (and platforms like GitHub/GitLab), and containerization with tools like Docker. The ability to quickly learn and integrate new tools into a workflow is essential.
Research Methodology
Finally, familiarity with the fundamentals of the research process is beneficial. This includes knowing how to conduct a literature review, how to structure a research proposal, and how to effectively read and analyze academic papers. Resources like the "Mining of Massive Datasets" and "Introduction to Information Retrieval" textbooks, along with tutorials like "the missing semester of your CS education," can help build this foundation.
It is important to note that this group is not intended for individuals undertaking their first major technical project. The expected workload is approximately 150 hours over a semester, equivalent to a 3-unit course. If you do not yet have experience with foundational data analysis tools like Pandas or NumPy, you are encouraged to take a project-heavy course and join the group in a future semester.
Choosing a Venue
Choosing a Venue
Selecting an appropriate venue is a foundational step in the research process. Our group has historically focused on two main types of venues that cater to different goals and interests. The first is Kaggle, a data science competition platform that provides an excellent environment for learning, often with the added incentive of prize money. The second, and the primary focus for our group's publication efforts, is the CLEF (Conference and Labs of the Evaluation Forum). We have a significant publication history at this European conference, with our contributions published as peer-reviewed working notes in the CEUR proceedings.
When selecting a competition, whether at CLEF, Kaggle, or another workshop, your decision should be driven by genuine interest. This interest typically stems from one of two motivations. The first is a passion for a specific domain. For example, a personal interest in a topic like herpetology can be a powerful motivator to contribute to a task like SnakeCLEF. The second is a desire to apply a particular technique. You may want to implement a method learned in a course or a research paper, such as applying network analysis principles to a citation dataset. Choosing a project that aligns with your intrinsic interests is critical for maintaining motivation throughout the semester.
Finally, it is essential to understand the requirements of your chosen venue and to be realistic about your own commitment. The bar for publication at CLEF requires submitting a functional system and a well-written, reproducible paper that details an interesting aspect of your work. Other academic venues may have a much higher bar for novelty, while a Kaggle competition might only require functional code. This commitment is not trivial; historically, only 50-75% of members who begin a project see it through to completion. Before committing to a team and a venue, ensure you have both a legitimate interest in the topic and the time required to contribute meaningfully.
Kaggle
Kaggle has established itself as a central platform for the global data science and machine learning community, providing a multifaceted environment for learning, competition, and collaboration. Acquired by Google in 2017 , it has grown to host over 15 million registered users from 194 countries as of October 2023.
Platform Structure
Kaggle's ecosystem is built around several key components:
- Competitions: This is arguably Kaggle's most well-known feature. Competitions are diverse, ranging from "Featured" competitions, which are high-profile challenges often sponsored by companies with substantial monetary prizes , to "Research" competitions that focus on novel scientific problems. "Playground" competitions offer a less intense environment for learning and experimentation, often with swag as prizes, while "Community" competitions are created by users themselves. A significant development is the prevalence of "Code Competitions," where participants submit their solutions as code within Kaggle Notebooks, ensuring a consistent hardware environment and often restricting external data access or internet connectivity during execution to promote fairness and reproducibility. Some competitions adopt a "Two-Stage" structure, where an initial phase is followed by a second phase with a new test dataset, adding a layer of complexity and testing model robustness. Examples of ongoing competitions include the "ARC Prize 2025" (Featured, $725,000 prize) and "BirdCLEF+ 2025" (Research, $50,000 prize).
- Datasets: Kaggle hosts a vast repository of datasets, contributed by both competition organizers and the wider user community. This resource is invaluable for independent projects, research, and learning beyond the scope of formal competitions.
- Notebooks (formerly Kernels): This web-based data science environment allows users to write and execute code (primarily Python and R), share their analyses, and collaborate on projects. Notebooks are integral to "Code Competitions" and facilitate learning from publicly shared code, enhancing reproducibility.
- Discussion Forums: Each competition, dataset, and notebook has associated discussion forums, which are vibrant spaces for asking questions, sharing insights, providing feedback, and fostering collaboration among users. Kaggle maintains community guidelines to ensure these interactions remain productive and respectful.
- Learn: Kaggle provides a curated set of tutorials and courses covering fundamental machine learning concepts and practical data science skills, serving as an accessible entry point for beginners.
The evolution of competition formats on Kaggle, particularly the rise of Code Competitions , reflects broader trends in the ML field. As models become more complex and resource-intensive, and as the community places greater emphasis on reproducibility and the entire analytical pipeline, these formats provide a more controlled and equitable environment. This contrasts with earlier "Simple Competitions" that relied solely on the upload of prediction files.
Common Task Types
Kaggle competitions span a wide array of machine learning tasks. These include, but are not limited to:
- Predictive Modeling: Classification and regression tasks are foundational, such as predicting survival on the Titanic (a classic beginner competition) or forecasting house prices.
- Computer Vision: Tasks like image classification , object detection, and facial keypoints detection are common. The "Image Matching Challenge 2025" aims to reconstruct 3D scenes from image collections.
- Natural Language Processing (NLP): Sentiment analysis, text classification, and question answering appear regularly.
- Time Series Forecasting: Predicting future values based on historical data, exemplified by the "Jane Street Real-Time Market Data Forecasting" competition.
- Specialized & Research-Oriented Tasks: Kaggle also hosts challenges on more domain-specific or frontier problems, such as predicting RNA 3D folding , isolated sign language recognition , developing physics-guided ML models for geophysical waveform inversion , or even building AI to generate SVG images using Large Language Models (LLMs).
The Kaggle Community
The Kaggle community is a defining feature of the platform. Its large and global user base actively engages in collaboration through team formation, public code sharing in Notebooks, and extensive discussions in the forums. A key element fostering this engagement is the Progression System. Users can advance through five tiers—Novice, Contributor, Expert, Master, and Grandmaster—based on their achievements in Competitions, Datasets, Notebooks, and Discussions. Performance in competitions is recognized with Bronze, Silver, and Gold medals, awarded based on a team's rank relative to the number of participants. This gamified system incentivizes active participation, continuous learning, and the production of high-quality work, contributing to a dynamic and competitive ecosystem. Beyond the competitive aspect, Kaggle serves as a platform for individuals to showcase their skills to potential employers and network with other professionals, potentially leading to career opportunities.
The open nature of Kaggle, with its emphasis on shared notebooks and active discussion forums , has a profound impact on how ML solutions are developed and disseminated. It democratizes access to state-of-the-art techniques, allowing individuals worldwide to learn from top performers and rapidly iterate on existing solutions. This can accelerate learning and lead to a convergence on effective approaches for particular problem types. However, this same openness can sometimes contribute to a degree of homogenization in solutions, where popular architectures or pre-processing pipelines become dominant. The progression system , while motivating, can also incentivize building upon successful public solutions. Consequently, achieving true innovation on Kaggle often requires not only mastering established best practices but also identifying unique insights or developing novel approaches that diverge from the prevailing high-scoring strategies.
Furthermore, Kaggle's role has expanded beyond being just a competition platform. Google's acquisition and the hosting of "Recruiting Competitions" underscore its significance as a major talent incubator. The types of "Featured" competitions , frequently sponsored by leading technology companies and other organizations, often reflect pressing industry challenges and the kinds of complex problems for which businesses are actively seeking ML-driven solutions. Success in these high-stakes competitions can directly enhance career prospects and visibility within the field.
Analyzing a Kaggle Competition:
To effectively participate in a Kaggle competition, a thorough understanding of its components is essential. Key elements typically found on a competition page (e.g., the "BirdCLEF+ 2025" competition ) include:
- Overview/Description: This section outlines the problem statement, its real-world context or motivation, and the specific goals of the competition.
- Data: Provides details about the dataset(s) used, including their structure, format, how to download them, and often, exploratory data analysis notebooks.
- Evaluation: Crucially, this section specifies the metric used to score submissions and rank participants, along with the required submission file format.
- Rules: Outlines eligibility criteria, rules for team formation and mergers, limits on daily submissions, policies regarding the use of external data, and any specific constraints for code competitions (e.g., time limits, hardware).
- Leaderboard: Displays the rankings of participants based on their submission scores, often split into public (based on a subset of the test data) and private (based on the full test data, revealed at the end) leaderboards.
- Discussion Forum: The central hub for participants to ask questions, share insights, discuss approaches, and report issues.
- Notebooks: A collection of public notebooks shared by organizers and participants, which can include starter code, data exploration, and example solutions.
CLEF
The Conference and Labs of the Evaluation Forum (CLEF) is a prominent European-based initiative dedicated to promoting research, innovation, and development in information access systems. A distinguishing feature of CLEF is its strong emphasis on multilinguality and multimodality, addressing the complexities of accessing and processing information across different languages and data types (e.g., text, image, video). CLEF also places significant importance on the advancement of evaluation methodologies, seeking to refine and extend traditional evaluation paradigms like the Cranfield model and explore innovative uses of experimental data.
CLEF's origins trace back to a track within the Text REtrieval Conference (TREC) focused on cross-language information retrieval (IR) for European languages. It became an independent initiative to expand coverage to more languages and a broader array of IR issues. This evolution reflects a broadening understanding of "information access," moving beyond traditional text retrieval to encompass diverse data types and user needs, as evidenced by its inclusion of tasks like species identification from media or health information access , and the integration of the INEX workshop on structured text retrieval.
CLEF's research agenda covers a wide spectrum of information retrieval challenges. While initially focused on monolingual, bilingual, and multilingual text retrieval, its scope has expanded considerably. The initiative now supports investigations into areas such as:
- Information retrieval for various European and non-European languages.
- Multimodal information access, integrating data from different sources like text, images, and audio.
- Specific application domains such as cultural heritage, digital libraries, social media, legal documents, and biomedical information.
- Evaluation of interactive and conversational information retrieval systems.
- Analysis of IR test collections and evaluation measures, including reproducibility and replicability issues.
Labs
The core operational structure of CLEF revolves around its Labs. These are essentially evaluation campaigns or tracks where specific research challenges are proposed, and participating research groups from academia and industry develop and test systems to address them. Each lab typically focuses on a particular theme or set of tasks. For instance, CLEF 2024 hosted a variety of labs, including :
- BioASQ: Large-scale biomedical semantic indexing and question answering.
- CheckThat!: Tasks related to fact-checking, such as check-worthiness estimation, subjectivity detection, and persuasion technique identification.
- ImageCLEF: A multimodal challenge involving image annotation, retrieval, and analysis across various domains (e.g., medical, social media).
- LifeCLEF: Species identification and prediction using various data types (e.g., images, audio), often with a conservation focus.
- PAN: Lab on stylometry, authorship analysis, and digital text forensics.
- Touché: Focus on argumentation systems, including argument retrieval and generation.
These labs provide the necessary infrastructure for system testing, tuning, and evaluation. A key contribution of many labs is the creation of reusable test collections (datasets and ground truth) that benefit the wider research community. Lab organizers define the tasks, provide the data, and specify the evaluation protocols. Participants then submit experimental "runs" (system outputs) and often follow up with "Working Notes" that detail their methodologies and findings. This lab-centric structure fosters the development of highly specialized research communities around specific IR challenges. The sustained focus on "evaluation methodologies" , including experiments with novel review processes like result-less review (where papers are initially assessed on methodology and research questions before results are presented) , indicates that CLEF actively contributes to shaping how research is conducted and assessed within these specialized domains, representing a meta-level contribution to the research landscape.
Working Notes
A significant output of participation in CLEF labs is the "Working Notes." These are technical reports authored by participating teams, describing the systems they developed and the experiments they conducted for the lab tasks. Key characteristics include:
- Publication: CLEF Working Notes are published as part of the CEUR Workshop Proceedings (CEUR-WS.org), making them citable and accessible to the research community. A list of past volumes is available on the CLEF Initiative website.
- Content: According to the guidelines , working notes should typically cover the tasks performed, main experimental objectives, the approach(es) used (including progress beyond the state-of-the-art), resources employed (datasets, tools), results obtained, a thorough analysis of these results, and perspectives for future work.
- Format and Submission: There is generally no strict upper page limit, though conciseness and effectiveness are encouraged. Submission is handled electronically, often via EasyChair, and specific formatting templates (e.g., CEUR-WS templates) are provided.
- Purpose: Working notes serve as a means for rapid dissemination of detailed experimental findings, methodologies, and even negative results, which are valuable for the scientific community. They represent a less formal but often more detailed account of experimental work compared to traditional conference papers. Some labs, even those run on external platforms like Kaggle but affiliated with CLEF (e.g., BirdCLEF+ ), encourage participants to submit working notes to the main CLEF conference, sometimes with awards for the best contributions. This system acts as a bridge, facilitating the quick sharing of experimental insights while still allowing for more polished, archival publications later.
Analyzing a Lab Task
When approaching a CLEF lab task, the analysis should focus on:
- Specific Research Questions: Identify the precise questions the lab and its constituent tasks aim to answer.
- Data Characteristics: Understand the nature of the provided datasets, paying close attention to multilingual aspects, multimodal features, and any specific annotations.
- Evaluation Methodology: Scrutinize the evaluation metrics and protocols, as these are often carefully designed by experts to assess particular system capabilities or nuances of the problem.
An illustrative example is the CLEF 2024 CheckThat! Lab, Task 1: Check-worthiness estimation.
- Main Problem: The core objective is to determine if a given piece of text—sourced from diverse genres like tweets or political debates—is "check-worthy." This involves assessing whether the text contains a verifiable factual claim and evaluating its potential for causing harm if the claim is false, thereby prioritizing it for fact-checking. For the 2024 edition, this task was offered in Arabic, Dutch, and English.
- Evaluation Metric: Performance in this task is measured using the macro-averaged F1-score. This metric calculates the F1-score (harmonic mean of precision and recall) for each class (check-worthy and not check-worthy) independently and then averages these scores. This approach ensures that performance on potentially less frequent but critical check-worthy claims is given due weight.
Other Venues
Beyond Kaggle and CLEF, several other platforms and major academic conferences play vital roles in the applied AI/ML research landscape.
TREC (Text REtrieval Conference)
- Focus: TREC is a long-standing series of workshops, initiated in 1992 and run by the U.S. National Institute of Standards and Technology (NIST). Its core mission is to support and encourage research within the information retrieval (IR) community by providing the necessary infrastructure for large-scale evaluation of text retrieval methodologies and to accelerate the transfer of technology from research labs to commercial products.
- Characteristics: TREC is organized into "tracks," each focusing on a particular subproblem or variant of the retrieval task. Over the years, tracks have covered diverse areas such as ad-hoc retrieval, question answering, cross-language IR, genomics IR, legal IR, and web search. For TREC 2024, continuing tracks included AToMiC (Authoring Tools for Multimedia Content) and NeuCLIR (Neural Cross-Language Information Retrieval), while new tracks included RAG (Retrieval-Augmented Generation). NIST typically provides large text collections and a set of questions (topics). Participating groups run their retrieval systems on this data and submit their results (e.g., ranked lists of documents). NIST then performs uniform scoring, often using evaluation techniques like pooling, where relevance judgments are made on a subset of documents retrieved by multiple systems.
- Outputs: Participants submit ranked lists of documents or other outputs specific to the track's task. The results, methodologies, and experiences are shared and discussed at the annual TREC workshop, and overview papers for each track are published in the TREC proceedings.
- TREC has been instrumental in advancing IR research by creating valuable, large-scale test collections and fostering a collaborative evaluation environment. Its track-based structure allows for focused research on a wide array of IR challenges. The introduction of the RAG track in 2024 is a clear indication of TREC's responsiveness to current trends in AI, particularly the integration of LLMs with retrieval systems.
KDD Cup
- Focus: The KDD Cup is the premier annual competition in data mining and knowledge discovery, organized by the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining (ACM SIGKDD). Its primary aim is to stimulate research and development in these fields by presenting challenging problems derived from diverse domains.
- Characteristics: KDD Cup challenges are renowned for often involving large, complex datasets and tasks that push the boundaries of current data mining techniques. Historical examples include the 2010 KDD Cup, which focused on predicting student answer correctness using one of the largest educational datasets at the time. More recently, the KDD Cup 2024 featured the "Open Academic Graph Challenge (OAG-Challenge)" for academic graph mining and the "Multi-Task Online Shopping Challenge for LLMs" hosted by Amazon.
- Outputs: Participants develop solutions to the posed problems, and the results and winning approaches are typically presented at a dedicated workshop during the annual KDD conference.
- The KDD Cup holds significant prestige and its challenges often set new directions in data mining research. The 2024 LLM challenge for online shopping, for instance, highlights the platform's alignment with contemporary advancements in AI.
NeurIPS Competitions
- Focus: Hosted as part of the Neural Information Processing Systems (NeurIPS) conference, one of the top-tier venues for machine learning research, these competitions aim to advance modern AI and ML algorithms. There is a strong encouragement for proposals that address clear scientific questions and have a positive societal impact, particularly those leveraging AI to support disadvantaged communities or to advance other scientific, technological, or business domains relevant to the NeurIPS community.
- Characteristics: NeurIPS features a dedicated Competition Track, with each accepted competition typically having an associated workshop where results are presented and discussed by organizers and participants. The tasks are often novel, cutting-edge, and interdisciplinary. Examples include the 2024 challenge on predicting hi-resolution rain radar movies from multi-band satellite sensors, requiring data fusion and video frame prediction , and past competitions on causal structure learning, multi-agent reinforcement learning (e.g., the "Melting Pot Contest" ), and foundation model prompting for medical image classification.
- Outputs: Competition results are presented at the NeurIPS workshops. Organizers and participants also have the option to submit post-competition analysis papers to the NeurIPS Datasets and Benchmarks (D\&B) track in the subsequent year.
- NeurIPS competitions are situated at the forefront of ML research, frequently exploring emerging areas and placing a strong emphasis on scientific rigor, methodological innovation, and potential societal benefits.
CVPR/ICCV Challenges
- Focus: The Conference on Computer Vision and Pattern Recognition (CVPR) and the International Conference on Computer Vision (ICCV) are the premier international conferences in the field of computer vision. Both conferences host a multitude of workshops, many of which include associated challenges.
- Characteristics: These challenges cover an extensive range of computer vision tasks. Examples from CVPR 2024 workshops include challenges on 3D scene understanding (e.g., the ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge ), efficient large vision models, human modeling and motion generation, multimodal learning, and various application-driven challenges in domains like agriculture (Agriculture-Vision ), sports (CVsports ), retail (RetailVision ), autonomous driving (WAD workshops often feature multiple challenges, e.g., End-To-End Driving at Scale, Occupancy and Flow ), and medical imaging. These challenges typically involve large-scale, highly specialized image or video datasets.
- Outputs: Participants submit their solutions, which are evaluated based on task-specific metrics. Winners and notable solutions are often announced and presented at the corresponding workshops , and results may be summarized in workshop proceedings or overview papers.
- Challenges at CVPR and ICCV are central to driving progress in computer vision, pushing the state-of-the-art in specific sub-fields, and providing crucial benchmarks for new algorithms and techniques. The sheer breadth of topics covered in the CVPR 2024 workshop list attests to the dynamism and scope of research in this area.
Our Research Process
Literature Review
How to conduct a literature review.
Work in Progress
Research Proposal
Proposing a Research Project
Purpose of a Proposal
A research proposal is the initial checkpoint for any competition project within DS@GT ARC. Its purpose is to demonstrate that the project is well-conceived and warrants the allocation of time and resources. A completed proposal is required for mentors to approve enrollment for credit in CS8903. It also serves as a foundational document for recruiting team members and establishing a clear plan for the semester. A standard proposal should be approximately two pages in length.
Required Content
A proposal must contain several key sections. It should begin with a concise overview of the competition, including the organizing body, the primary task, and all relevant deadlines. This is followed by the project motivation, which states the rationale for selecting this problem, such as its potential impact or technical novelty. The document must also describe the provided dataset, detailing its size, format, and structure, and specify the official metric used for evaluation. It is essential to summarize any preliminary research, including baselines or prior work, to provide context for the proposed approach. The core of the proposal is the detailed technical methodology, which should cite foundational research papers or software libraries. Finally, the proposal must include a high-level project timeline with monthly milestones for key phases like data processing, model development, final submission, and report writing.
Feasibility Assessment
This section evaluates the practical viability of the project. It should include an analysis of the project's feasibility relative to the team's current skills, available computational resources, and the competition timeline. It is also necessary to identify potential risks, such as data quality issues or computational constraints, and propose a clear mitigation strategy for each. The assessment should conclude by acknowledging any areas where the team will need to acquire new knowledge or seek mentorship.
Proposal Utilization
Once approved, the proposal serves several functions throughout the project lifecycle. A well-structured proposal enables prospective teammates to understand the project's scope and objectives, which facilitates recruitment. It also acts as a roadmap for project management and should be referenced during weekly updates, reviews, and the preparation of the final report. As a model for balancing data preparation, modeling goals, and a timeline, teams should consult the GeoLifeCLEF 2024 proposal located in the group repository. The approval of the proposal marks the official commencement of the project.
Research Proposal Example: GeoLifeCLEF 2024
CS8903: Special Problems Project Proposal
Name: Anthony Miyaguchi <[email protected]>
Student ID: amiyaguchi3 Date: 2023-11-08
- Main Proposal Idea: Lead a DS@GT team on the GeoLifeCLEF 2024 challenge and submit a working note paper at the CLEF 2024 conference.
- The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe
- SPECIAL PROBLEMS (8903) PERMIT
Spatio-temporal species distribution estimation for GeoLifeCLEF 2024 with unsupervised representation learning of remote sensing data
Objective
The objective of the special problems is to solve the GeoLifeCLEF 2024 challenge and publish a working notes paper to the CLEF 2024 conference detailing the implemented system, in collaboration with student peers at Georgia Tech. The resulting scope of work is estimated to take 3-credit hours, or 150 hours of work, by the primary author.
Background and Motivation
CLEF is the cross-language evaluation forum, an information retrieval conference with heavy emphasis of experimentation on shared tasks. GeoLifeCLEF is a challenge hosted by the LifeCLEF lab within CLEF.
GeoLifeCLEF combines five million heterogeneous presence-only records and six thousand exhaustive presence-absence surveys collected from 2017 to 2021. Models are trained with environmental data like 10-meter resolution RGB and Near-Infra-Red satellite images and climatic variables.
Data Preprocessing
We transform domain-specific geospatial rasters (GeoTIFF) into a format optimized for distributed, parallel data access patterns (Parquet). We convert an area of interest (AOI) into a regular lattice of square tiles and store relevant features cropped by the bounding box of its tile. We store all data in a Parquet dataset to load in bulk to Spark or Torch.
We create two development datasets with a maximum partition size of 1GB. The first is a subset of the data that covers a small geographic area encompassing a city, forest, and mountain. The second is a label dataset that contains the minimum features for density estimation, e.g., latitude, longitude, date, and positive indicator of species.
Modeling
Our system is composed of four models. We use Tile2Vec to embed geo-rasters and a linear operator estimator to embed high-dimensional time series. These models aim to learn a low-dimensional representation of the data that preserve certain geometrical properties like the triangle inequality. We fit an ordinal regression to learn the relative frequency of biodiversity across a regular lattice of features. We do this by converting positive examples into ranked lists generated by nearest neighbor labels in feature space and fitting a learning-to-rank model. We finally learn a generative model of the data to generate biodiversity rasters and images using priors from ordinal regression.
Our baseline model is a species model derived from geolocation and date. We measure improvement upon the baseline by adding learned geo and time series embeddings via ablation study.
End-to-end Task
We submit the results of our system, intending to reach first place on the leaderboard. We intend to see significant improvements between baseline models and more complex models. In addition to submitting to the leaderboard, we generate detailed rasters/images of various species for visualization.
Timeline
LifeCLEF 2024 | ImageCLEF / LifeCLEF - Multimedia Retrieval in CLEF
- Jan 2024: registration opens for all LifeCLEF challenges
- Jan-March 2024: training and test data release
- 6 May 2024: deadline for submission of runs by participants
- 13 May 2024: release of processed results by the task organizers
- 31 May 2024: deadline for submission of working note papers by participants [CEUR-WS proceedings]
- 24 June 2024: notification of acceptance of participant's working note papers [CEUR-WS proceedings]
- 8 July 2024: camera ready copy of participant's working note papers and extended lab overviews by organizers
- 9-12 Sept 2024: CLEF 2024 Grenoble - France
Date | Week | Task/Topic | Deliverable/Events |
---|---|---|---|
2024-01-08 | 1 | Engineering - Download training and testing dataset from 2023/2024 | Competition start |
2024-01-15 | 2 | Exploratory Data Analysis | |
2024-01-22 | 3 | Engineering - Schema and Parquet | |
2024-01-29 | 4 | Engineering - Schema and Parquet | Parquet datasets in GCS, dev set of data (\<1GB single partition) available for exploratory modeling |
2024-02-05 | 5 | Modeling - Learning to Rank | |
2024-02-12 | 6 | Modeling - Gaussian Mixture Models and Stochastic Variational Inference | |
2024-02-19 | 7 | Modeling - Tile2Vec | |
2024-02-26 | 8 | Modeling - Tile2Vec | |
2024-03-04 | 9 | Modeling - Tile2Vec | |
2024-03-11 | 10 | Modeling - Koopman Operator, SVD, Dynamic Mode Decomposition | Working notes of dataset and model description |
2024-03-18 | 11 | Spring Break | |
2024-03-25 | 12 | Engineering - Embedding cache, indexing and search Modeling - Ordinal regression | |
2024-04-01 | 13 | Engineering - Model pipeline | First submission to the competition, screenshot of leaderboard |
2024-04-08 | 14 | Engineering - Model pipeline | |
2024-04-15 | 15 | Ablation Study, Hyperparameter Tuning | |
2024-04-22 | 16 | Ablation Study, Hyperparameter Tuning | |
2024-04-29 | 17 | Finals, Working notes | Submission deadline for competition, first draft of working notes, screenshot of leaderboard, parquet dataset in GCS |
2024-05-06 | 18 | Summer, Working notes revision |
Infrastructure
Code is hosted on GitHub at https://github.com/dsgt-kaggle-clef/geolifeclef-2024. Cloud compute and storage is on Google Cloud Platform with a personal billing account.
Collaboration and Supervision
This project stems from collaboration within the Data Science at Georgia Tech (DS@GT) student group. Prior submissions from the DS@GT team to the CLEF conference have won $5,000 worth of prizes across two best working note competitions.
As the DS@GT GeoLifeCLEF 2024 team lead, I would be collaborating with two fellow OMSCS students. The time-commitment estimate (3 credit hours) is for independent work that I carry out in the context of shared responsibilities in the team.
The supervising faculty member for the project is responsible for administration, such as registration and grading, with no expectation to advise the research process (although pointers are greatly appreciated). The supervisor will grade using an article for publication in a state ready for early review.
References
Botella, C., Deneu, B., Marcos, D., Servajean, M., Estopinan, J., Larcher, T., ... & Joly, A. (2023). The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe. arXiv preprint arXiv:2308.05121., https://arxiv.org/abs/2308.05121
Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., & Ermon, S. (2018). Tile2Vec: unsupervised representation learning for spatially distributed data. arXiv., https://arxiv.org/abs/1805.02855
Brunton, S. L., & Kutz, J. N. (2019). Data-driven science and engineering: Machine learning, dynamical systems, and control. Cambridge University Press., https://www.cambridge.org/core/books/datadriven-science-and-engineering/77D52B171B60A496EAFE4DB662ADC36E
Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007, June). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (pp. 129-136)., https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research., https://jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf
Forming Teams
Finding a Team
Before seeking a team, it is essential to have a clear idea of your research interests. A shared interest in the competition's subject matter is the foundation of a motivated team. For example, you should first identify a competition or dataset that genuinely interests you, such as BirdCLEF for avian bioacoustics or CheckThat! for fact-checking language models.
You can find teammates through the primary ARC Interest Group, within your academic courses (e.g., Deep Learning, Machine Learning), or on community platforms like the OMSCS Slack and the Ed Research Board.
Team Structure and Roles
Once a group with a shared research interest is formed, the first step is to establish a clear structure and define roles.
Team Lead
The Team Lead acts as the primary facilitator and organizer. This role is not necessarily the lead researcher but is responsible for the team's operational integrity. Key responsibilities include coordinating regular team meetings at a cadence that works for all members, ensuring all members are aware of competition deadlines and rules, serving as the main liaison between the team and the broader ARC organization, and taking ultimate responsibility for ensuring that final submissions are completed correctly and on time.
Team Member
Team members are the core contributors to the research project. All members are expected to be active participants. Key responsibilities include actively contributing ideas, code, and text for the final paper, allocating a significant amount of time to the project (estimated at 100-150 hours over the semester, roughly equivalent to a 3-unit course), engaging with the team on a regular basis with a minimum of once per week recommended, and being transparent about capacity while communicating early if unable to continue.
Communication and Collaboration Tools
For group communication, teams will use the main Data Science at Georgia Tech Slack for organization-wide news and can create private channels or use Microsoft Teams or Discord for internal discussion. Collaborative writing should be done using Overleaf for LaTeX or Google Docs for other documents. All code must be version-controlled using GitHub, with repositories hosted in the official Data Science at Georgia Tech ARC GitHub organization.
Conflict Resolution and Team Dynamics
Proactively manage team dynamics by setting clear expectations at the project's start regarding communication, workload, and standards. It is important to maintain professional empathy, recognizing that all participants are managing other commitments. Grant grace for unforeseen circumstances, but also hold team members accountable for their commitments. Should conflicts arise that cannot be resolved internally, teams should utilize available resources by seeking guidance from experienced members or ARC group leadership. Ultimately, consistent, early, and clear communication is the most effective tool for preventing and resolving team conflicts.
Conducting Experiments
Once a team is formed and a proposal is in place, the core of the research work begins with conducting experiments. This phase is heavily reliant on the ability to implement and manage large-scale systems. The evaluation-focused competitions we participate in often involve datasets ranging from tens to hundreds of gigabytes, demanding efficient data processing and robust code.
The Experimental Workflow
The process begins with downloading the dataset and performing a thorough exploratory data analysis (EDA). The goal of EDA is to develop a deep understanding of the data's characteristics, including its schema, the size and nature of the train/test splits, and the statistical properties of its main features. Following this analysis, you must design your experiments, starting with a clear and simple baseline. For an information retrieval task, this might involve running a BM25 keyword search. You then implement your novel methodology, which is intended to improve upon this baseline. A key part of this stage is conducting ablation studies, where you systematically remove components from your system to isolate and quantify their individual contribution to the overall performance.
Organization and Reproducibility
As you conduct these experiments, meticulous organization of both code and data is paramount to ensure your results are reproducible. While specific organizational strategies are left to individual teams, it is essential to keep a detailed log of all experiments, their parameters, and their outcomes. This can be managed in a spreadsheet or integrated directly into your paper draft. Furthermore, you must be familiar with the standard evaluation formats required by the venue, such as the TREC-style format common in information retrieval conferences.
Team Collaboration and Project Management
Beyond the technical execution, conducting experiments successfully is a significant project management challenge. A critical task for the team is to strategically break down the work into smaller, independent components that members can tackle in parallel. This process of decomposing the problem requires constant and clear communication among all team members. Effective communication is the most crucial tool for navigating the complexities of collaborative research, ensuring that workloads are distributed effectively and the project remains on track.
Writing a Paper
A primary goal of the ARC group is to publish research. Because applied research competitions provide a well-defined evaluation task, dataset, and metric, our focus in writing is to clearly document the system we build and any novel contributions within our workflow.
The Structure of a Competition Paper
A typical competition paper is organized into several key sections. The introduction provides an overview of the paper, context for the dataset and your solution, and the central thesis of your work. This is followed by the background and related work section, which situates your research by discussing past work in the competition, related systems, and any technical context necessary to understand your solution.
The core of the paper is the methodology, where you detail the unique aspects of your system. This includes the tools used, data transformations performed, and any ablation studies conducted to determine the contribution of different components. It should begin with simple baselines that your work improves upon. Crucially, this section must contain all necessary details to ensure your work is reproducible. After the methodology, the results section presents quantitative findings from applying your methods to the data, including pipeline runtimes and performance scores for all systems tested.
The discussion section offers your interpretation of the results and their implications. This is where you analyze why the system behaved as it did and discuss ideas that arose from the experimental outcomes. This section should also include a discussion of future work, outlining potential research directions if you had more time. Finally, the conclusion briefly summarizes the paper's main contributions and closes out the work.
The Writing Process and Timeline
While the intensive coding phase often precedes focused writing, preliminary sections like the literature review can be drafted early in the semester. The main writing effort typically requires 20 to 40 hours over a period of two to four weeks. It is advisable to begin this process as early as possible.
All papers should be written in a collaborative LaTeX environment like Overleaf, using the templates provided by the group or the conference venue. The quality of writing should be high, similar to that expected in a graduate-level course like Machine Learning. The goal is to clearly document the results of the hard work already completed during the research phase.
Using Generative AI Tools
Generative AI tools can be leveraged responsibly as part of the writing process. They are effective for assistive tasks like formatting data into tables or helping to find related works for a literature review. However, you are strongly discouraged from using these tools to generate large portions of your paper, especially the methodology, results, or discussion. Doing the analysis and writing by hand is a critical part of understanding the research domain and demonstrating your comprehension of the work. Using AI to automate the core analysis is a form of academic dishonesty and cheats you of a key learning experience. Always be transparent about your use of these tools and ensure you are representing your school and lab responsibly.
Submission and Peer Review
Once submitted, your paper will undergo peer review. Acceptance rates vary significantly by venue. For workshops like those at CLEF, our group has a high likelihood of acceptance, though you may be required to make revisions based on reviewer feedback. For more selective conferences, the bar for acceptance is much higher. If a paper is not accepted at a particular venue, the work can always be shared publicly by uploading it as a preprint to a server like arXiv.
Applied Methods
Applied Methods
Work in Progress
Environment Setup
SSH and Git Setup
Authenticate to GitHub using GitHub CLI
This section streamlines the authentication process to GitHub using the GitHub CLI gh
, which simplifies the SSH setup.
You can find the GitHub CLI installation instructions here.
- Run
gh auth login
to begin the authentication process. - When prompted, select SSH as the preferred protocol for Git operations.
- If you don't already have an SSH key,
gh
will prompt you to generate one. Follow the on-screen instructions to create a new SSH key. gh
will automatically add your SSH key to your GitHub account. Follow any additional prompts to complete the process.- After completing the setup, run
gh auth status
to check if you're successfully authenticated.
If you want to do it manually, check the GitHub page: Generating a new SSH key and adding it to the ssh-agent
Verify GitHub User Information (Optional)
It's good practice to ensure your Git identity is correctly set:
- Check Git Configurations: Run
git config --list
to see your Git configurations, including user name and email. - Set Git User Information If Not Set: If not already set, configure your Git user information:
git config --global user.email "[email protected]"
git config --global user.name "Your Name"
Replace with your GitHub email and name.
Configuring SSH Host Aliases
It is useful to setup your ~/.ssh/config
on your host as follows:
Host pace
HostName login-phoenix.pace.gatech.edu
User your_username
Host pace-interactive
HostName atl1-1-02-007-30-1.pace.gatech.edu
User your_username
ProxyJump pace
This adds host aliases pace
and pace-interactive
.
Make sure to add your SSH key to the ~/.ssh/authorized_keys
section after logging into PACE via ssh [email protected]
.
You can now access PACE using ssh pace
and it will automatically log you in.
The pace-interactive
alias will use the login node as a jump host, allowing you to run VS Code sessions on interactive sessions.
Read more about SSH config files: ssh_config(5) manual page
Add Authorized Keys to PACE
# Log into PACE
ssh pace
# Create .ssh directory if it doesn't exist
mkdir -p ~/.ssh
# Create or append to authorized_keys file
nano ~/.ssh/authorized_keys
# Paste your public key, save and exit (Ctrl+O, Enter, Ctrl+X)
# Set correct permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
# Test your connection from your local machine
ssh pace
Updating pace-interactive Alias
Allocate a new interactive session on PACE. For example:
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=2:00:00 --qos=inferno
Make sure to keep this terminal around. Get the hostname from the session:
$ hostname
atl1-1-02-007-30-1.pace.gatech.edu
Copy the hostname and update your ~/.ssh/config
file:
Host pace-interactive
HostName atl1-1-02-007-30-1.pace.gatech.edu
User your_username
ProxyJump pace
Then you can SSH via ssh pace-interactive
from your host machine through the terminal or VS Code.
Note that this will also allow you to port forward any services running on these nodes.
Advanced SSH Configuration
Port Forwarding for Development
Common port forwarding scenarios for research work:
# Forward Jupyter notebook (local 8888 -> PACE 8888)
ssh -L 8888:localhost:8888 pace-interactive
Working with Git on PACE
Basic Git Setup on PACE
# SSH to PACE
ssh pace
# Load Git module (if using module system)
module load git
# Configure Git if not done already
git config --global user.name "Your Name"
git config --global user.email "[email protected]"
# Set VS Code as default editor if available
git config --global core.editor "code --wait"
# Verify configuration
git config --list
Clone and Work with Repositories
# Clone your research repository
git clone [email protected]:username/your-research-project.git
# Or clone using the GitHub CLI
gh repo clone username/your-research-project
Python Setup
Using UV for Python management
UV is a modern, fast Python package installer and resolver written in Rust. It's designed to be a drop-in replacement for pip and pip-tools, with significantly faster dependency resolution and installation.
Installing UV
# On Linux/macOS
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Via pip (if you already have Python)
pip install uv
Basic UV Usage
# Install packages
uv pip install numpy pandas
UV vs Traditional Tools
- Speed: UV is 10-100x faster than pip for dependency resolution
- Lock files: Built-in support for lock files with
uv.lock
- Resolution: More reliable dependency resolution
- Compatibility: Drop-in replacement for most pip commands
Package Management and Dependencies
Use pyproject.toml
for managing project dependencies. This file allows you to specify your project's metadata and dependencies in a standardized way as defined by PEP 621.
Modern Dependency Management with UV
Instead of manually editing requirements files, use uv add
to add dependencies to your pyproject.toml
:
# Add core dependencies
uv add numpy pandas matplotlib scikit-learn
# Add development dependencies
uv add --dev jupyter pytest black
# Add optional dependencies for specific features
uv add --optional ir "pyterrier>=0.9.0" "python-terrier>=0.4.0"
Example pyproject.toml
Use uv init
to create a pyproject.toml
file with the necessary structure.
It should look something like this:
[project]
name = "arc-seminar-project"
version = "0.1.0"
description = "Research project for ARC seminar"
authors = [{name = "Your Name", email = "[email protected]"}]
dependencies = [
"numpy>=1.24.0", # https://numpy.org/
"pandas>=2.0.0", # https://pandas.pydata.org/
"matplotlib>=3.7.0", # https://matplotlib.org/
"scikit-learn>=1.3.0", # https://scikit-learn.org/
"torch>=2.0.0", # https://pytorch.org/
"transformers>=4.30.0", # https://huggingface.co/transformers/
]
Installing Dependencies
# Install base dependencies
uv sync
# Install with optional IR dependencies
uv sync --extra ir
# Install with development dependencies
uv sync --extra dev
# Install everything
uv sync --all-extras
Virtual Environments
Use uv venv
to create and manage virtual environments easily.
This will create a .venv
directory in your project folder, which isolates your Python environment.
uv venv
source .venv/bin/activate # Linux/macOS
Essential Libraries
Core Data Science Stack
Package | Description |
---|---|
numpy | Fundamental package for numerical computations in Python. |
pandas | Data manipulation and analysis library, providing data structures like DataFrames. |
matplotlib | Plotting library for creating static, animated, and interactive visualizations in Python. |
scikit-learn | Machine learning library for Python, providing simple and efficient tools for data mining |
scipy | Library for scientific and technical computing, building on NumPy. |
Machine Learning and Deep Learning
Package | Description |
---|---|
torch | PyTorch library for deep learning, providing tensor computations and neural network capabilities. |
transformers | Hugging Face library for working with transformer models and datasets, particularly in NLP. |
datasets | Hugging Face library for accessing and processing datasets. |
tokenizers | Fast tokenizers for NLP preprocessing. |
Information Retrieval and Search
Package | Description |
---|---|
pyterrier | Python framework for information retrieval experimentation and research. |
pyserini | Lucene-based toolkit for reproducible information retrieval research. |
faiss-cpu | Facebook AI Similarity Search library for efficient similarity search and clustering. |
sentence-transformers | Library for sentence, text and image embeddings using transformer models. |
Workflow and Pipeline Management
Package | Description |
---|---|
luigi | Workflow management system for building complex data pipelines. |
wandb | Weights & Biases for experiment tracking and model management. |
Development and Productivity
Package | Description |
---|---|
jupyter | Interactive computing environment for notebooks. |
tqdm | Progress bars for Python loops and iterables. |
rich | Library for rich text and beautiful formatting in the terminal. |
Jupyter Setup
Installing Jupyter Lab/Notebook
Local Installation
Install Jupyter using UV (recommended for modern Python projects):
# Add Jupyter to your project
uv add jupyter
# Or install globally
uv tool install jupyter
# Alternative: Install [JupyterLab](https://jupyterlab.readthedocs.io/) (more modern interface)
uv add jupyterlab
# Or install both
uv add jupyter jupyterlab
Verify Installation
# Check Jupyter installation
jupyter --version
# Check JupyterLab installation
jupyter lab --version
# List available kernels
jupyter kernelspec list
Running Jupyter on PACE
Basic Setup on PACE
# SSH into PACE
ssh pace
# Load Python module (if using module system)
module load python/3.11
# Create or activate your virtual environment
source ~/.venvs/research-env/bin/activate
# Install Jupyter in your environment
uv add jupyterlab
# Alternative: using pip if UV not available
pip install jupyterlab
Running Jupyter on Login Node (Limited Use)
Only use login nodes for light testing. For actual work, use interactive or batch jobs.
# Quick test on login node (use sparingly)
jupyter lab --no-browser --port=8888
# Better: specify IP to avoid conflicts
jupyter lab --no-browser --ip=0.0.0.0 --port=8888
Running Jupyter on Interactive Nodes (Recommended)
Method 1: Interactive Session + Port Forwarding
# 1. Allocate interactive session
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=4:00:00 --qos=inferno
# 2. Note the allocated node hostname
hostname
# Example output: atl1-1-02-007-30-1.pace.gatech.edu
# 3. Update your SSH config (from local machine)
./update-pace-interactive.sh atl1-1-02-007-30-1.pace.gatech.edu
# 4. Start Jupyter on the interactive node
jupyter lab --no-browser --ip=0.0.0.0 --port=8888
# 5. From another terminal on your local machine, forward the port
ssh -L 8888:localhost:8888 pace-interactive
Method 2: SLURM Batch Job for Long-Running Notebooks
Create a SLURM script jupyter_job.slurm
:
#!/bin/bash
#SBATCH --job-name=jupyter-server
#SBATCH --account=paceship-dsgt_clef2025
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=8:00:00
#SBATCH --qos=inferno
#SBATCH --output=jupyter-%j.out
#SBATCH --error=jupyter-%j.err
# Load modules
module load python/3.11
# Activate environment
source ~/.venvs/research-env/bin/activate
# Get the node hostname
NODE=$(hostname)
echo "Jupyter server running on node: $NODE"
echo "Use this command to connect:"
echo "ssh -L 8888:$NODE:8888 pace"
# Start Jupyter
jupyter lab --no-browser --ip=0.0.0.0 --port=8888
Submit and monitor the job:
# Submit the job
sbatch jupyter_job.slurm
# Check job status
squeue -u $USER
# View output (contains connection instructions)
cat jupyter-JOBID.out
Port Forwarding for Remote Access
Simple Port Forwarding
# Forward port 8888 from PACE to your local machine
ssh -L 8888:localhost:8888 pace
# If using interactive node
ssh -L 8888:localhost:8888 pace-interactive
# Multiple ports (Jupyter + MLflow + TensorBoard)
ssh -L 8888:localhost:8888 -L 5000:localhost:5000 -L 6006:localhost:6006 pace-interactive
VS Code Integration
If using VS Code with Remote SSH:
- Connect to PACE via Remote SSH
- Open terminal in VS Code
- Start Jupyter:
jupyter lab --no-browser --port=8888
- VS Code will automatically offer to forward the port
- Click the notification or go to Ports tab
Best Practices for Notebook Organization
Project Structure
research-project/
├── notebooks/
│ ├── 01-data-exploration.ipynb
│ ├── 02-preprocessing.ipynb
│ ├── 03-model-training.ipynb
│ ├── 04-evaluation.ipynb
│ └── 99-final-results.ipynb
├── src/
│ ├── __init__.py
│ ├── data/
│ ├── models/
│ └── utils/
├── data/
│ ├── raw/
│ ├── processed/
│ └── external/
├── pyproject.toml
└── README.md
Notebook Naming Conventions
# Use numbered prefixes for workflow order
01-data-exploration.ipynb
02-feature-engineering.ipynb
03-model-training.ipynb
04-evaluation.ipynb
# Use descriptive names with dates for experiments
2025-01-15-bert-fine-tuning.ipynb
2025-01-16-ensemble-methods.ipynb
# Separate exploration from production
exploratory/
├── data-analysis-jan-15.ipynb
└── model-experiments.ipynb
production/
├── final-model-training.ipynb
└── evaluation-metrics.ipynb
VS Code Setup
Installing VS Code
Download and Install
- Download VS Code: Go to Visual Studio Code website
- Choose your platform:
- Windows: Download
.exe
installer - macOS: Download
.dmg
file - Linux: Download
.deb
(Ubuntu/Debian) or.rpm
(Red Hat/Fedora)
Recommended Extensions
Install these extensions for a complete data science setup:
- Python: Official Python extension with IntelliSense, debugging, and linting
- Ruff: Fast Python linter and formatter
- Jupyter: Native notebook support in VS Code
- Remote - SSH: Connect to remote machines via SSH
Remote SSH Extension Setup
Initial Configuration
Follow the SSH and Git Setup guide to configure your SSH connection to PACE.
- Install Remote SSH Extension:
ms-vscode-remote.remote-ssh
- Open Command Palette:
Ctrl+Shift+P
(Windows/Linux) orCmd+Shift+P
(macOS) - Type: "Remote-SSH: Connect to Host"
- Enter Host: Use your PACE SSH configuration (e.g.,
pace
orpace-interactive
)
Remote Development Tips
Port Forwarding in VS Code
- Automatic Detection: VS Code detects running services and offers to forward ports
- Manual Forwarding:
- Open Command Palette (
Ctrl+Shift+P
) - Type "Ports: Focus on Ports View"
- Click "Forward a Port"
- Enter port number (e.g., 8888 for Jupyter)
Configuring Python Environment
Python Interpreter Selection
- Open Command Palette:
Ctrl+Shift+P
- Type: "Python: Select Interpreter"
- Choose from:
- System Python
- Virtual environments
Working with Jupyter Notebooks in VS Code
Native Jupyter Support
VS Code provides native Jupyter notebook support:
- Open
.ipynb
files directly in VS Code - Create new notebooks:
Ctrl+Shift+P
→ "Jupyter: Create New Jupyter Notebook" - Select kernel: Click kernel name in top-right corner
As long as you have Jupyter installed into your Python environment, you can run notebooks seamlessly. The Python environment is ideally a virtual environment.
Jupyter Server Configuration
Connect to Remote Jupyter Server
- Start Jupyter on PACE:
ssh pace-interactive
jupyter lab --no-browser --ip=0.0.0.0 --port=8888
- Connect VS Code:
- Open Command Palette (
Ctrl+Shift+P
) - Type "Jupyter: Specify Jupyter Server for Connections"
- Enter server URL:
http://localhost:8888
- Enter token from Jupyter output
PACE Setup
Work in Progress
Getting Access to PACE
Account Types and Limits
- Student Accounts: Free tier with limited compute hours
- Research Allocations: Group allocations with shared compute time
- Storage: Home directory (50GB) + group storage allocation
Connecting to PACE via SSH
SSH Configuration (Recommended)
See the SSH and Git Setup guide for detailed instructions on configuring your SSH connection.
First Login Setup
# After first successful login
ssh pace
# Check your environment
hostname
whoami
pwd
df -h $HOME
# Check available modules
module avail
Understanding the PACE Environment
Cluster Architecture
PACE consists of multiple clusters:
-
Phoenix: Primary cluster with modern hardware
-
Login nodes: General access, file management, job submission
- Compute nodes: CPU and GPU nodes for actual computation
-
Storage: High-performance parallel file systems
-
ICE: Specialized cluster for certain workloads
- Different hardware configurations
- May have different software availability
Node Types
Login Nodes
- Purpose: File management, job submission, light development
- Limitations:
- No intensive computation (kills jobs after 30 minutes)
- Shared among all users
- Limited memory and CPU
- Use for: Editing files, submitting jobs, basic testing
Compute Nodes
- CPU Nodes: Various configurations (8-64 cores, 32GB-1TB RAM)
- GPU Nodes: NVIDIA GPUs (V100, A100, RTX series)
- Access via: SLURM job scheduler only
- Use for: Training models, running experiments, intensive computation
Software Environment
Module System
PACE uses environment modules to manage software:
# List available modules
module avail
# Search for specific software
module avail python
module avail cuda
module avail torch
# Load modules
module load python/3.11
module load cuda/11.8
# List loaded modules
module list
# Unload modules
module unload python/3.11
module purge # unload all
# Show module details
module show python/3.11
Resource Allocation System
Quality of Service (QoS) Levels
- inferno: Default queue, higher priority, long running jobs
- embers: Low priority, pre-emptible jobs with 1 hour guaranteed runtime
Account Structure
# Check your allocations
pace-quota
File System and Storage
TODO: Add details about file systems, storage options, and best practices for data management.
Basic SLURM Commands
Job Submission
Interactive Jobs
# Request interactive session
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=2:00:00 --qos=inferno
# Request GPU node
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --gres=gpu:1 --time=1:00:00 --qos=inferno
# Exit interactive session
exit
Batch Jobs
Create a SLURM script job.slurm
:
#!/bin/bash
#SBATCH --job-name=my-experiment
#SBATCH --account=paceship-dsgt_clef2025
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=4:00:00
#SBATCH --qos=inferno
#SBATCH --output=job-%j.out
#SBATCH --error=job-%j.err
# Load modules
module load python/3.11
module load cuda/11.8
# Activate environment
source ~/.venvs/research-env/bin/activate
# Run your script
python train_model.py --config configs/bert.yaml
Submit the job:
sbatch job.slurm
Job Management
# Check job queue
squeue -u $USER
# Check all jobs for your account
squeue -A paceship-dsgt_clef2025
# Check job details
scontrol show job JOBID
# Cancel job
scancel JOBID
# Cancel all your jobs
scancel -u $USER
Job Monitoring
# Check running jobs
squeue -u $USER -t RUNNING
# Monitor resource usage of running job
ssh to_compute_node
htop
nvidia-smi # for GPU usage
Common SLURM Parameters
Resource Requests
# CPU jobs
--nodes=1 # Number of nodes
--ntasks=1 # Number of tasks (usually 1 for Python)
--cpus-per-task=8 # CPU cores per task
--mem-per-cpu=4G # Memory per CPU core
--time=4:00:00 # Wall time (HH:MM:SS)
# GPU jobs
--gres=gpu:1 # Request 1 GPU
--gres=gpu:rtx_6000:1 # Request specific GPU type
--gres=gpu:2 # Request 2 GPUs
# Memory options
--mem=32G # Total memory for job
--mem-per-cpu=4G # Memory per CPU core
Job Control
--job-name=my-job # Job name
--output=job-%j.out # Output file (%j = job ID)
--error=job-%j.err # Error file
--mail-type=ALL # Email notifications
[email protected] # Email address
Best Practices
Resource Management
- Start small: Test with short jobs first
- Request only what you need: Don't waste resources
- Use checkpointing: Save progress for long jobs
Concepts
Exploratory Data Analysis
Work in Progress
Data Understanding and Profiling
Statistical Analysis and Visualization
Data Quality Assessment
Feature Engineering Techniques
EDA Best Practices for ML/IR
Introduction to Embeddings
An embedding is a technique used to represent high-dimensional data, like text or images, as a fixed-size vector of numbers in a lower-dimensional space. The key idea is that this new representation captures the semantic meaning of the original data, so items with similar meanings will have vectors that are close to each other. This is incredibly useful because it's much easier to work with vectors than with raw text or pixels.
Generating and Visualizing Embeddings
You'll typically use a pre-trained model from a library like Sentence-Transformers (built on Hugging Face) to generate embeddings. These models have been trained on vast amounts of data and have learned to create meaningful vector representations.
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# A list of sentences to embed
sentences = [
"This is an example sentence.",
"Each sentence is converted to a vector.",
"Semantic search is a common application."
]
# Generate embeddings
embeddings = model.encode(sentences)
print(embeddings.shape)
# Expected output: (3, 384), where 3 is the number of sentences
# and 384 is the dimension of the embedding vector.
Once you have your data embedded into a matrix (e.g., an n_documents
x d_dimensions
matrix), it's hard to understand what those numbers mean directly.
The best way to get an intuitive feel is to visualize them.
You can use a dimensionality reduction technique like PCA or a manifold learning algorithm like UMAP or t-SNE to project your high-dimensional vectors down to 2D, which can then be plotted.
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Assume 'embeddings' is your N x D matrix from the previous step
# Assume 'labels' is an array of corresponding labels for each document
# Reduce dimensions to 2D for plotting
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)
# Create a scatter plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap='viridis')
plt.title("2D Visualization of Document Embeddings")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(handles=scatter.legend_elements()[0], labels=unique_labels)
plt.show()
Common Applications
Embeddings are not just for visualization; they are the foundation for many powerful techniques used in modern machine learning and information retrieval.
Semantic Search
Instead of matching keywords, semantic search finds documents based on their conceptual meaning. This is done by embedding a search query and then finding the document vectors that are closest to it in the embedding space, typically using cosine similarity. For large-scale search, Approximate Nearest Neighbor (ANN) libraries like Faiss are used to find the "good enough" nearest neighbors very quickly.
import faiss
import numpy as np
# Assume 'doc_embeddings' is your N x D matrix of document embeddings
dimension = doc_embeddings.shape[1]
# 1. Build a Faiss index
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings.astype('float32')) # Faiss requires float32
# 2. Embed a query
query_text = ["Find me news about new technology"]
query_embedding = model.encode(query_text).astype('float32')
# 3. Search the index
k = 5 # Number of nearest neighbors to retrieve
distances, indices = index.search(query_embedding, k)
print(f"Top {k} most similar document indices: {indices}")
Transfer Learning and Re-ranking
Embeddings are a form of transfer learning. The knowledge learned by a large foundation model is "transferred" to your task through its vector representations. You can use these vectors as features to train simpler models for tasks like classification or regression.
Re-ranking is a more advanced two-stage search technique.
- Retrieval: Use a fast method (like BM25 keyword search or a Faiss index) to retrieve an initial set of candidate documents (e.g., the top 100).
- Re-ranking: Use a more powerful, but slower, model like a cross-encoder to re-evaluate and re-order just this small set of candidates to get a more accurate final ranking. The cross-encoder takes a (query, document) pair and outputs a relevance score.
from sentence_transformers.cross_encoder import CrossEncoder
# Load a pre-trained cross-encoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# The query and the documents retrieved from the first stage
query = "Find me news about new technology"
retrieved_docs = [
"A new AI chip was announced today.",
"Global stock markets are down.",
"The latest smartphone features a foldable screen."
]
# Create pairs of (query, document)
sentence_pairs = [[query, doc] for doc in retrieved_docs]
# The cross-encoder predicts a relevance score for each pair
scores = cross_encoder.predict(sentence_pairs)
# Sort documents by the new scores
sorted_docs = sorted(zip(scores, retrieved_docs), key=lambda x: x[0], reverse=True)
print("Re-ranked Documents:", sorted_docs)
Why This Matters for Competitions
Understanding and using embeddings is critical for success in many competitions. They allow you to leverage the power of massive foundation models efficiently. Whether you're building a search system, a classifier, or a recommendation engine, being able to generate, visualize, and apply embeddings will give you a significant advantage. Be sure to practice with these tools, but also be mindful that generating embeddings for very large corpora can be computationally expensive.
Information Retrieval Basics
Work in Progress
Core IR Concepts and Terminology
Indexing and Document Representation
Ranking and Scoring Functions
Evaluation Metrics (MAP, NDCG, Precision/Recall)
Modern IR with Neural Networks
Large Language Models
Work in Progress
Understanding Transformer Architecture
Pre-training vs Fine-tuning
Hugging Face Transformers Library
Parameter-Efficient Fine-Tuning (PEFT)
LLM Inference and Deployment Considerations
PACE Containers
Work in Progress
Introduction to Apptainer/Singularity
Building Custom Containers
Running Containers on PACE
GPU Support in Containers
Container Best Practices for Reproducibility
Workflow Management
Work in Progress
Experiment Tracking with MLflow and WandB
Version Control for Data Science
SLURM Job Management and Monitoring
Reproducible Environments
Pipeline Orchestration Tools
Cookbook
Cookbook
This sections contains a collection of guides for common tasks.