DS@GT ARC Notes

Repository

Introduction

DS@GT ARC Notes

Note

This site is a work-in-progress and is actively being developed. Please check back frequently for updates.

These notes prepares students for original research contributions at evaluation-focused venues like CLEF. In a dual-track format, participants will first critically analyze the AI/ML/IR applied research landscape (Kaggle, KDD, NeurIPS, TREC, CLEF) to identify viable shared tasks, foster team formation, and initiate research proposals. Simultaneously, a hands-on track develops essential skills for using Georgia Tech's PACE supercomputing cluster, including SLURM, Apptainer, and building ML/IR pipelines with PyTorch and Hugging Face.

See the DS@GT ARC homepage for more information about the club.

How These Notes Are Organized

Note

This site is a work-in-progress and is actively being developed. Please check back frequently for updates.

These notes are designed to help you contribute to original research at applied venues like CLEF and Kaggle. They are structured in a dual-track format to help you develop both the strategic and the hands-on skills needed for success in research competitions.

Two Tracks

The content is split into two main themes that run concurrently. You can think of them as the "why" and the "how" of competition research.

Competition Strategy and Concepts: This track focuses on the research process. It covers how to analyze different competition platforms (Kaggle, CLEF, KDD Cup, NeurIPS), dissect past solutions to find research gaps, form effective teams, and develop a strong research proposal.
Applied Methods and Tooling: This track is all about hands-on, practical skills. It provides guides for using Georgia Tech's PACE high-performance computing (HPC) environment, including tools like SLURM and Apptainer. You'll also find walkthroughs for building ML/IR pipelines that use embeddings, transfer learning, and semantic search with libraries like PyTorch, Hugging Face, and Faiss.

By the end, you'll be well-equipped to propose and execute original research for competitive academic workshops.

Learning Outcomes

By working through these notes and participating in club activities, you'll learn to:

Critically evaluate research from platforms like CLEF and Kaggle to identify strengths and opportunities.
Design a structured research proposal that frames a problem and outlines a clear methodology.
Apply core ML/IR concepts like embeddings, transfer learning, and evaluation metrics (MAP/NDCG).
Leverage HPC tools like PACE, SLURM, and Apptainer for efficient and reproducible experiments.
Collaborate effectively using Git/GitHub and communicate your technical findings.

Prerequisites

To get the most out of these notes, it helps to have some background knowledge and be aware of the expected time commitment.

Background: These notes assume you have some familiarity with machine learning and information retrieval concepts, perhaps from courses like Machine Learning, Deep Learning, or NLP. Intermediate proficiency in Python and experience with libraries like NumPy and Pandas is required. Some exposure to the Linux command line will be necessary for using PACE.
Time Commitment: This is a hands-on group that requires active participation. You should expect to spend approximately 3-4 hours per week engaging with the material, which includes a 1-hour synchronous online meeting and 2-3 hours of asynchronous work on your own.

Required Tools

You'll need a reliable computer with a stable internet connection and a few key pieces of software:

A modern web browser (Chrome or Firefox recommended).
VSCode with the Remote-SSH extension.
A GitHub account.
Access to Georgia Tech's PACE HPC environment (which is provided to you).

All readings will consist of online documentation, research papers, and competition write-ups. No textbook purchase is required.

Schedule

Note

This site is a work-in-progress and is actively being developed. Please check back frequently for updates.

Term: Fall 2025
Meeting Format: Synchronous 1-hour online weekly meetings

Course Description

This interest group is designed to prepare Georgia Tech students for making original research contributions at competitive, evaluation-focused academic venues such as CLEF.

The interest group follows a dual-module format over the semester:

Module 1 (weeks 1–8) focuses on critically analyzing the applied research landscape of AI, Machine Learning, and Information Retrieval, exploring platforms like Kaggle, KDD, NeurIPS, TREC, and CLEF. The goal is to identify viable shared tasks and begin developing research proposals for a CLEF 2026 shared task.
Module 2 (weeks 9–13) concentrates on developing the essential skills required to use Georgia Tech's PACE supercomputing cluster , providing hands-on experience with tools like SLURM, Apptainer, PyTorch, and Hugging Face. The interest group culminates in students being fully equipped to propose and execute original research.

Calendar and Weekly Schedule

This schedule provides a week-by-week breakdown of topics, hands-on work, and key deliverables.

Date	Week #	Module #	Notes	Meeting Focus & Core Topics	Hands-On / Asynchronous Work	Relevant Readings	Key Deliverables
8/25/2025	1	1		Orientation: The "Why" of Applied Research & Competitions	Browse Kaggle and CLEF overview pages; note 1-2 interesting tasks.		-
9/1/2025	2	1	Labor Day	Introductions + Choosing a Venue		Venues Overview	-
9/8/2025	3	1	CLEF Madrid	Choosing a Venue + Kaggle		Venues Kaggle Venues	-
9/15/2025	4	1		Kaggle	EDA	Exploratory Data Analysis	Kaggle: Competition Notebook Solution
9/22/2025	5	1		Other Venues + Literature Review	Select two papers/notes (e.g., one from CLEF 2025, one from KDD/NeurIPS/TREC) and compare their methodologies.	CLEF Literature Review	-
9/29/2025	9	2		CLEF 2025 Madrid panel discussion			-
10/6/2025	7	1	Fall Break
10/13/2025	6	1		CLEF + Literature Review	Brainstorm 1-2 potential research questions or alternative approaches.	Other Venues Literature Review	-
10/20/2025	8	1		Research Proposal		Research Proposal	Initial Research Proposal Draft (CS 8901/8903)
10/27/2025	10	2		Git/GitHub & Initial PACE Onboarding	PACE onboarding	Applied Methods	-
11/3/2025	11	2		LLMs	Large Language Models	LLMs Concepts	PACE Exercise 1: TBD
11/10/2025	12	2		Embeddings	Introduction to Embeddings	Embeddings	PACE Exercise 2: TBD
11/17/2025	13	2		Information Retrieval	Information Retrieval Basics	Information Retrieval	PACE Exercise 3: TBD
11/24/2025	14		Thanksgiving				-
12/1/2025	15			CLEF 2026 Research Proposal Presentations	Deliver final research proposal presentations.		Final Research Proposal Draft (CS 8901/8903)
12/8/2025	16			DS@GT ARC Team formation	Form teams for the Spring research cycle.		ARC spring team formation

Fall 2025 Interest Group

Below are the videos for each week for Fall 2025 on DS@GT ARC YouTube channel

Orientation/Week 1: The orientation meeting was held on campus at Georgia Tech
Week 2: Introductions + choosing a venue.
Week 3: This week was all about choosing a competition venue and a task within, and preparing slides to describe the task, dataset, and what excites you about this task.
Week 4: This week was all about Exploratory Data Analysis (EDA).
Week 5: This week was all about Literature Review.

Competition

An Overview of Competition Research

This section of the notes focuses on the process of conducting research and is aimed at individuals who have experience with machine learning or data science projects but may be new to formal research. Many students in the OMSCS program possess strong analytical and technical skills from their professional careers that are directly transferable to applied research competitions. This guide is designed to bridge the gap between that practical experience and the structured process of academic and competition-based research.

The following chapters will walk through the essential soft skills required to navigate the research landscape. We will begin by outlining useful background knowledge before discussing how to choose a research venue, focusing on Kaggle, the CLEF Conference, and workshops associated with conferences like TREC, NTCIR, KDD, and NeurIPS. We will then cover how to perform a literature review to understand the current state of the art for a given problem.

Subsequently, we will detail how to develop a research proposal, a crucial step for anyone looking to lead or recruit for a team. A significant portion is dedicated to team formation, as assembling a group with the right mix of interests, skills, and time commitment is often the most challenging aspect of the process. While this guide will describe the paper writing process, it assumes some familiarity from a paper-heavy course. Please note that a deep dive into the specifics of conducting experiments is considered beyond the scope of this introductory text and is reserved for the active competition teams. This chapter provides the framework for identifying where you are on the research frontier and what opportunities are available to you.

Useful Background

This section outlines the skills and experience that are beneficial for contributing effectively to a research competition team. While formal research experience is not a prerequisite, a strong foundation in related areas is essential.

Foundational Experience and Eligibility

Official eligibility extends to all Georgia Tech students (undergraduate, graduate, online) and alumni. Beyond this, we look for individuals who have demonstrated experience with complex projects, either through project-heavy coursework or full-time software engineering roles. The ability to work with large codebases and navigate complex systems is crucial.

Equally important are transferable organizational skills. Experience in roles similar to program management, where you are responsible for scheduling meetings, tracking progress, and communicating requirements and timelines, is highly valuable. These skills are fundamental to the successful coordination of a research team.

Core Technical Skills

A broad set of technical skills underpins success in applied research competitions. While no single person is expected to be an expert in all areas, proficiency in several is expected.

Mathematical Foundations

A working knowledge of certain mathematical concepts is frequently required. An understanding of linear algebra is essential for working with the embedding spaces common in modern machine learning, including concepts like dimensionality reduction. Probability and statistics are critical for designing experiments and determining if the results are statistically significant. A basic understanding of calculus is also beneficial.

Machine Learning and Data Engineering

You should be familiar with fundamental machine learning concepts such as the distinction between classification and regression, and the purpose of data splits. Proficiency with the modern machine learning stack is key, including PyTorch and the Hugging Face ecosystem. It is helpful to understand core concepts behind large language models, such as the Transformer architecture, attention mechanisms, and fine-tuning strategies like parameter-efficient fine-tuning (PEFT). Strong data engineering skills are also highly transferable. This includes the ability to build data pipelines (e.g., converting data from XML to Parquet), parallelize jobs for distributed systems, and work with datasets that are larger than memory.

Information Retrieval and Systems

Many competitions involve information retrieval. Experience with search concepts like BM25 and cosine similarity, as well as search systems like Faiss, Anserini, or Elasticsearch, is a significant advantage. General software and systems engineering proficiency is non-negotiable. You must be comfortable with the Linux terminal, version control with Git (and platforms like GitHub/GitLab), and containerization with tools like Docker. The ability to quickly learn and integrate new tools into a workflow is essential.

Research Methodology

Finally, familiarity with the fundamentals of the research process is beneficial. This includes knowing how to conduct a literature review, how to structure a research proposal, and how to effectively read and analyze academic papers. Resources like the "Mining of Massive Datasets" and "Introduction to Information Retrieval" textbooks, along with tutorials like "the missing semester of your CS education," can help build this foundation.

It is important to note that this group is not intended for individuals undertaking their first major technical project. The expected workload is approximately 150 hours over a semester, equivalent to a 3-unit course. If you do not yet have experience with foundational data analysis tools like Pandas or NumPy, you are encouraged to take a project-heavy course and join the group in a future semester.

Choosing a Venue

Selecting an appropriate venue is a foundational step in the research process. Our group has historically focused on two main types of venues that cater to different goals and interests. The first is Kaggle, a data science competition platform that provides an excellent environment for learning, often with the added incentive of prize money. The second, and the primary focus for our group's publication efforts, is the CLEF (Conference and Labs of the Evaluation Forum). We have a significant publication history at this European conference, with our contributions published as peer-reviewed working notes in the CEUR proceedings.

When selecting a competition, whether at CLEF, Kaggle, or another workshop, your decision should be driven by genuine interest. This interest typically stems from one of two motivations. The first is a passion for a specific domain. For example, a personal interest in a topic like ornithology can be a powerful motivator to contribute to a task like BirdCLEF. The second is a desire to apply a particular technique. You may want to implement a method learned in a course or a research paper, such as applying network analysis principles to a citation dataset. Choosing a project that aligns with your intrinsic interests is critical for maintaining motivation throughout the semester.

Finally, it is essential to understand the requirements of your chosen venue and to be realistic about your own commitment. The bar for publication at CLEF requires submitting a functional system and a well-written, reproducible paper that details an interesting aspect of your work. Other academic venues may have a much higher bar for novelty, while a Kaggle competition might only require functional code. This commitment is not trivial; historically, only 50-75% of members who begin a project see it through to completion. Before committing to a team and a venue, ensure you have both a legitimate interest in the topic and the time required to contribute meaningfully.

Kaggle

Kaggle has established itself as a central platform for the global data science and machine learning community, providing a multifaceted environment for learning, competition, and collaboration. Acquired by Google in 2017 , it has grown to host over 15 million registered users from 194 countries as of October 2023.

Platform Structure

Kaggle's ecosystem is built around several key components:

Competitions: This is arguably Kaggle's most well-known feature. Competitions are diverse, ranging from "Featured" competitions, which are high-profile challenges often sponsored by companies with substantial monetary prizes , to "Research" competitions that focus on novel scientific problems. "Playground" competitions offer a less intense environment for learning and experimentation, often with swag as prizes, while "Community" competitions are created by users themselves. A significant development is the prevalence of "Code Competitions," where participants submit their solutions as code within Kaggle Notebooks, ensuring a consistent hardware environment and often restricting external data access or internet connectivity during execution to promote fairness and reproducibility. Some competitions adopt a "Two-Stage" structure, where an initial phase is followed by a second phase with a new test dataset, adding a layer of complexity and testing model robustness. Examples of ongoing competitions include the ARC Prize 2025 (Featured, $725,000 prize) and BirdCLEF+ 2025 (Research, $50,000 prize).
Datasets: Kaggle hosts a vast repository of datasets, contributed by both competition organizers and the wider user community. This resource is invaluable for independent projects, research, and learning beyond the scope of formal competitions.
Notebooks (formerly Kernels): This web-based data science environment allows users to write and execute code (primarily Python and R), share their analyses, and collaborate on projects. Notebooks are integral to "Code Competitions" and facilitate learning from publicly shared code, enhancing reproducibility.
Discussion Forums: Each competition, dataset, and notebook has associated discussion forums, which are vibrant spaces for asking questions, sharing insights, providing feedback, and fostering collaboration among users. Kaggle maintains community guidelines to ensure these interactions remain productive and respectful.
Learn: Kaggle provides a curated set of tutorials and courses covering fundamental machine learning concepts and practical data science skills, serving as an accessible entry point for beginners.

The evolution of competition formats on Kaggle, particularly the rise of Code Competitions , reflects broader trends in the ML field. As models become more complex and resource-intensive, and as the community places greater emphasis on reproducibility and the entire analytical pipeline, these formats provide a more controlled and equitable environment. This contrasts with earlier "Simple Competitions" that relied solely on the upload of prediction files.

Common Task Types

Kaggle competitions span a wide array of machine learning tasks. These include, but are not limited to:

Predictive Modeling: Classification and regression tasks are foundational, such as predicting survival on the Titanic (a classic beginner competition) or forecasting house prices.
Computer Vision: Tasks like image classification , object detection, and facial keypoints detection are common. The Image Matching Challenge 2025 aims to reconstruct 3D scenes from image collections.
Natural Language Processing (NLP): Sentiment analysis, text classification, and question answering appear regularly.
Time Series Forecasting: Predicting future values based on historical data, exemplified by the Jane Street Real-Time Market Data Forecasting competition.
Specialized & Research-Oriented Tasks: Kaggle also hosts challenges on more domain-specific or frontier problems, such as predicting RNA 3D folding , isolated sign language recognition , developing physics-guided ML models for geophysical waveform inversion , or even building AI to generate SVG images using Large Language Models (LLMs).

The Kaggle Community

The Kaggle community is a defining feature of the platform. Its large and global user base actively engages in collaboration through team formation, public code sharing in Notebooks, and extensive discussions in the forums. A key element fostering this engagement is the Progression System. Users can advance through five tiers—Novice, Contributor, Expert, Master, and Grandmaster—based on their achievements in Competitions, Datasets, Notebooks, and Discussions. Performance in competitions is recognized with Bronze, Silver, and Gold medals, awarded based on a team's rank relative to the number of participants. This gamified system incentivizes active participation, continuous learning, and the production of high-quality work, contributing to a dynamic and competitive ecosystem. Beyond the competitive aspect, Kaggle serves as a platform for individuals to showcase their skills to potential employers and network with other professionals, potentially leading to career opportunities.

The open nature of Kaggle, with its emphasis on shared notebooks and active discussion forums , has a profound impact on how ML solutions are developed and disseminated. It democratizes access to state-of-the-art techniques, allowing individuals worldwide to learn from top performers and rapidly iterate on existing solutions. This can accelerate learning and lead to a convergence on effective approaches for particular problem types. However, this same openness can sometimes contribute to a degree of homogenization in solutions, where popular architectures or pre-processing pipelines become dominant. The progression system , while motivating, can also incentivize building upon successful public solutions.

Consequently, achieving true innovation on Kaggle often requires not only mastering established best practices but also identifying unique insights or developing novel approaches that diverge from the prevailing high-scoring strategies.
Furthermore, Kaggle's role has expanded beyond being just a competition platform. Google's acquisition and the hosting of "Recruiting Competitions" underscore its significance as a major talent incubator. The types of "Featured" competitions , frequently sponsored by leading technology companies and other organizations, often reflect pressing industry challenges and the kinds of complex problems for which businesses are actively seeking ML-driven solutions. Success in these competitions can directly enhance career prospects and visibility within the field.

Analyzing a Kaggle Competition:

To effectively participate in a Kaggle competition, a thorough understanding of its components is essential. Key elements typically found on a competition page (e.g., the BirdCLEF+ 2025 competition) include:

Overview/Description: This section outlines the problem statement, its real-world context or motivation, and the specific goals of the competition.
Data: Provides details about the dataset(s) used, including their structure, format, how to download them, and often, exploratory data analysis notebooks.
Evaluation: Crucially, this section specifies the metric used to score submissions and rank participants, along with the required submission file format.
Rules: Outlines eligibility criteria, rules for team formation and mergers, limits on daily submissions, policies regarding the use of external data, and any specific constraints for code competitions (e.g., time limits, hardware).
Leaderboard: Displays the rankings of participants based on their submission scores, often split into public (based on a subset of the test data) and private (based on the full test data, revealed at the end) leaderboards.
Discussion Forum: The central hub for participants to ask questions, share insights, discuss approaches, and report issues.
Notebooks: A collection of public notebooks shared by organizers and participants, which can include starter code, data exploration, and example solutions.

CLEF

The Conference and Labs of the Evaluation Forum (CLEF) is a prominent European-based initiative dedicated to promoting research, innovation, and development in information access systems. A distinguishing feature of CLEF is its strong emphasis on multilinguality and multimodality, addressing the complexities of accessing and processing information across different languages and data types (e.g., text, image, video). CLEF also places significant importance on the advancement of evaluation methodologies, seeking to refine and extend traditional evaluation paradigms like the Cranfield model and explore innovative uses of experimental data.

CLEF's origins trace back to a track within the Text REtrieval Conference (TREC) focused on cross-language information retrieval (IR) for European languages. It became an independent initiative to expand coverage to more languages and a broader array of IR issues. This evolution reflects a broadening understanding of "information access," moving beyond traditional text retrieval to encompass diverse data types and user needs, as evidenced by its inclusion of tasks like species identification from media or health information access , and the integration of the INEX workshop on structured text retrieval.

CLEF's research agenda covers a wide spectrum of information retrieval challenges. While initially focused on monolingual, bilingual, and multilingual text retrieval, its scope has expanded considerably. The initiative now supports investigations into areas such as:

Information retrieval for various European and non-European languages.
Multimodal information access, integrating data from different sources like text, images, and audio.
Specific application domains such as cultural heritage, digital libraries, social media, legal documents, and biomedical information.
Evaluation of interactive and conversational information retrieval systems.
Analysis of IR test collections and evaluation measures, including reproducibility and replicability issues.

Labs

The core operational structure of CLEF revolves around its Labs. These are essentially evaluation campaigns or tracks where specific research challenges are proposed, and participating research groups from academia and industry develop and test systems to address them. Each lab typically focuses on a particular theme or set of tasks. For instance, CLEF 2024 hosted a variety of labs, including :

BioASQ: Large-scale biomedical semantic indexing and question answering.
CheckThat!: Tasks related to fact-checking, such as check-worthiness estimation, subjectivity detection, and persuasion technique identification.
ImageCLEF: A multimodal challenge involving image annotation, retrieval, and analysis across various domains (e.g., medical, social media).
LifeCLEF: Species identification and prediction using various data types (e.g., images, audio), often with a conservation focus.
PAN: Lab on stylometry, authorship analysis, and digital text forensics.
Touché: Focus on argumentation systems, including argument retrieval and generation.

These labs provide the necessary infrastructure for system testing, tuning, and evaluation. A key contribution of many labs is the creation of reusable test collections (datasets and ground truth) that benefit the wider research community. Lab organizers define the tasks, provide the data, and specify the evaluation protocols. Participants then submit experimental "runs" (system outputs) and often follow up with "Working Notes" that detail their methodologies and findings. This lab-centric structure fosters the development of highly specialized research communities around specific IR challenges. The sustained focus on "evaluation methodologies" , including experiments with novel review processes like result-less review (where papers are initially assessed on methodology and research questions before results are presented) , indicates that CLEF actively contributes to shaping how research is conducted and assessed within these specialized domains, representing a meta-level contribution to the research landscape.

Working Notes

A significant output of participation in CLEF labs is the "Working Notes." These are technical reports authored by participating teams, describing the systems they developed and the experiments they conducted for the lab tasks. Key characteristics include:

Publication: CLEF Working Notes are published as part of the CEUR Workshop Proceedings (CEUR-WS.org), making them citable and accessible to the research community. A list of past volumes is available on the CLEF Initiative website.
Content: According to the guidelines , working notes should typically cover the tasks performed, main experimental objectives, the approach(es) used (including progress beyond the state-of-the-art), resources employed (datasets, tools), results obtained, a thorough analysis of these results, and perspectives for future work.
Format and Submission: There is generally no strict upper page limit, though conciseness and effectiveness are encouraged. Submission is handled electronically, often via EasyChair, and specific formatting templates (e.g., CEUR-WS templates) are provided.
Purpose: Working notes serve as a means for rapid dissemination of detailed experimental findings, methodologies, and even negative results, which are valuable for the scientific community. They represent a less formal but often more detailed account of experimental work compared to traditional conference papers. Some labs, even those run on external platforms like Kaggle but affiliated with CLEF (e.g., BirdCLEF+ ), encourage participants to submit working notes to the main CLEF conference, sometimes with awards for the best contributions. This system acts as a bridge, facilitating the quick sharing of experimental insights while still allowing for more polished, archival publications later.

Analyzing a Lab Task

When approaching a CLEF lab task, the analysis should focus on:

Specific Research Questions: Identify the precise questions the lab and its constituent tasks aim to answer.
Data Characteristics: Understand the nature of the provided datasets, paying close attention to multilingual aspects, multimodal features, and any specific annotations.
Evaluation Methodology: Scrutinize the evaluation metrics and protocols, as these are often carefully designed by experts to assess particular system capabilities or nuances of the problem.

An illustrative example is the CLEF 2024 CheckThat! Lab, Task 1: Check-worthiness estimation.

Main Problem: The core objective is to determine if a given piece of text—sourced from diverse genres like tweets or political debates—is "check-worthy." This involves assessing whether the text contains a verifiable factual claim and evaluating its potential for causing harm if the claim is false, thereby prioritizing it for fact-checking. For the 2024 edition, this task was offered in Arabic, Dutch, and English.
Evaluation Metric: Performance in this task is measured using the macro-averaged F1-score. This metric calculates the F1-score (harmonic mean of precision and recall) for each class (check-worthy and not check-worthy) independently and then averages these scores. This approach ensures that performance on potentially less frequent but critical check-worthy claims is given due weight.

Other Venues

Beyond Kaggle and CLEF, several other platforms and major academic conferences play vital roles in the applied AI/ML research landscape.

TREC (Text REtrieval Conference)

Focus: TREC is a long-standing series of workshops, initiated in 1992 and run by the U.S. National Institute of Standards and Technology (NIST). Its core mission is to support and encourage research within the information retrieval (IR) community by providing the necessary infrastructure for large-scale evaluation of text retrieval methodologies and to accelerate the transfer of technology from research labs to commercial products.
Characteristics: TREC is organized into "tracks," each focusing on a particular subproblem or variant of the retrieval task. Over the years, tracks have covered diverse areas such as ad-hoc retrieval, question answering, cross-language IR, genomics IR, legal IR, and web search. For TREC 2024, continuing tracks included AToMiC (Authoring Tools for Multimedia Content) and NeuCLIR (Neural Cross-Language Information Retrieval), while new tracks included RAG (Retrieval-Augmented Generation). NIST typically provides large text collections and a set of questions (topics). Participating groups run their retrieval systems on this data and submit their results (e.g., ranked lists of documents). NIST then performs uniform scoring, often using evaluation techniques like pooling, where relevance judgments are made on a subset of documents retrieved by multiple systems.
Outputs: Participants submit ranked lists of documents or other outputs specific to the track's task. The results, methodologies, and experiences are shared and discussed at the annual TREC workshop, and overview papers for each track are published in the TREC proceedings.
TREC has been instrumental in advancing IR research by creating valuable, large-scale test collections and fostering a collaborative evaluation environment. Its track-based structure allows for focused research on a wide array of IR challenges. The introduction of the RAG track in 2024 is a clear indication of TREC's responsiveness to current trends in AI, particularly the integration of LLMs with retrieval systems.

KDD Cup

Focus: The KDD Cup is the premier annual competition in data mining and knowledge discovery, organized by the Association for Computing Machinery's Special Interest Group on Knowledge Discovery and Data Mining (ACM SIGKDD). Its primary aim is to stimulate research and development in these fields by presenting challenging problems derived from diverse domains.
Characteristics: KDD Cup challenges are renowned for often involving large, complex datasets and tasks that push the boundaries of current data mining techniques. Historical examples include the 2010 KDD Cup, which focused on predicting student answer correctness using one of the largest educational datasets at the time. More recently, the KDD Cup 2024 featured the Open Academic Graph Challenge (OAG-Challenge) for academic graph mining and the Multi-Task Online Shopping Challenge for LLMs hosted by Amazon.
Outputs: Participants develop solutions to the posed problems, and the results and winning approaches are typically presented at a dedicated workshop during the annual KDD conference.
The KDD Cup holds significant prestige and its challenges often set new directions in data mining research. The 2024 LLM challenge for online shopping, for instance, highlights the platform's alignment with contemporary advancements in AI.

NeurIPS Competitions

Focus: Hosted as part of the Neural Information Processing Systems (NeurIPS) conference, one of the top-tier venues for machine learning research, these competitions aim to advance modern AI and ML algorithms. There is a strong encouragement for proposals that address clear scientific questions and have a positive societal impact, particularly those leveraging AI to support disadvantaged communities or to advance other scientific, technological, or business domains relevant to the NeurIPS community.
Characteristics: NeurIPS features a dedicated Competition Track, with each accepted competition typically having an associated workshop where results are presented and discussed by organizers and participants. The tasks are often novel, cutting-edge, and interdisciplinary. Examples include the 2024 challenge on predicting hi-resolution rain radar movies from multi-band satellite sensors, requiring data fusion and video frame prediction , and past competitions on causal structure learning, multi-agent reinforcement learning (e.g., the Melting Pot Contest), and foundation model prompting for medical image classification.
Outputs: Competition results are presented at the NeurIPS workshops. Organizers and participants also have the option to submit post-competition analysis papers to the NeurIPS Datasets and Benchmarks (D\&B) track in the subsequent year.
NeurIPS competitions are situated at the forefront of ML research, frequently exploring emerging areas and placing a strong emphasis on scientific rigor, methodological innovation, and potential societal benefits.

CVPR/ICCV Challenges

Focus: The Conference on Computer Vision and Pattern Recognition (CVPR) and the International Conference on Computer Vision (ICCV) are the premier international conferences in the field of computer vision. Both conferences host a multitude of workshops, many of which include associated challenges.
Characteristics: These challenges cover an extensive range of computer vision tasks. Examples from CVPR 2024 workshops include challenges on 3D scene understanding (e.g., the ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge), efficient large vision models, human modeling and motion generation, multimodal learning, and various application-driven challenges in domains like agriculture (Agriculture-Vision), sports (CVsports), retail (RetailVision), autonomous driving (WAD workshops often feature multiple challenges, e.g., End-To-End Driving at Scale, Occupancy and Flow), and medical imaging. These challenges typically involve large-scale, highly specialized image or video datasets.
Outputs: Participants submit their solutions, which are evaluated based on task-specific metrics. Winners and notable solutions are often announced and presented at the corresponding workshops , and results may be summarized in workshop proceedings or overview papers.
Challenges at CVPR and ICCV are central to driving progress in computer vision, pushing the state-of-the-art in specific sub-fields, and providing crucial benchmarks for new algorithms and techniques. The sheer breadth of topics covered in the CVPR 2024 workshop list attests to the dynamism and scope of research in this area.

Our Research Process

Literature Review

A literature review is the process of understanding what other people have already done to solve a problem. In the context of a research competition, your goal is to understand the current state of the art, common techniques, and key evaluation metrics for your specific task. This allows you to build upon existing work rather than reinventing it.

The Review Process

The process should begin with the most direct approach: finding previous solutions to the exact competition you have chosen. For a Kaggle competition, this means studying the notebooks of previous winners, while for an academic venue like CLEF, this involves finding the working notes and papers from teams who participated in the same shared task in prior years. After grounding yourself in direct solutions, you should broaden your search to related academic work. It is not practical to read every paper; instead, you should read strategically by skimming many titles and abstracts to identify relevance before committing to reading a select few papers in-depth. Modern AI-powered research tools can significantly accelerate this process by searching for and consolidating information on a given topic into a summary report.

Organization and Synthesis

As you gather resources, it is critical to keep your sources organized. A dedicated citation manager like Zotero is highly recommended, as it allows you to save articles directly from your browser and share libraries with your team. The final and most important step of the review is to synthesize your findings into a coherent narrative. This written summary, in your own words, should explain the state of the art and how your proposed work fits into it. This synthesis is a critical component for both the initial research proposal and the introduction or related work section of your final paper.

Research Proposal

Proposing a Research Project

Purpose of a Proposal

A research proposal is the initial checkpoint for any competition project within DS@GT ARC. Its purpose is to demonstrate that the project is well-conceived and warrants the allocation of time and resources. A completed proposal is required for mentors to approve enrollment for credit in CS8903. It also serves as a foundational document for recruiting team members and establishing a clear plan for the semester. A standard proposal should be approximately two pages in length.

Required Content

A proposal must contain several key sections. It should begin with a concise overview of the competition, including the organizing body, the primary task, and all relevant deadlines. This is followed by the project motivation, which states the rationale for selecting this problem, such as its potential impact or technical novelty. The document must also describe the provided dataset, detailing its size, format, and structure, and specify the official metric used for evaluation. It is essential to summarize any preliminary research, including baselines or prior work, to provide context for the proposed approach. The core of the proposal is the detailed technical methodology, which should cite foundational research papers or software libraries. Finally, the proposal must include a high-level project timeline with monthly milestones for key phases like data processing, model development, final submission, and report writing.

Feasibility Assessment

This section evaluates the practical viability of the project. It should include an analysis of the project's feasibility relative to the team's current skills, available computational resources, and the competition timeline. It is also necessary to identify potential risks, such as data quality issues or computational constraints, and propose a clear mitigation strategy for each. The assessment should conclude by acknowledging any areas where the team will need to acquire new knowledge or seek mentorship.

Proposal Utilization

Once approved, the proposal serves several functions throughout the project lifecycle. A well-structured proposal enables prospective teammates to understand the project's scope and objectives, which facilitates recruitment. It also acts as a roadmap for project management and should be referenced during weekly updates, reviews, and the preparation of the final report. As a model for balancing data preparation, modeling goals, and a timeline, teams should consult the GeoLifeCLEF 2024 proposal located in the group repository. The approval of the proposal marks the official commencement of the project.

Research Proposal Example: GeoLifeCLEF 2024

CS8903: Special Problems Project Proposal

Name: Anthony Miyaguchi <[email protected]>
Student ID: amiyaguchi3 Date: 2023-11-08

Main Proposal Idea: Lead a DS@GT team on the GeoLifeCLEF 2024 challenge and submit a working note paper at the CLEF 2024 conference.
The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe
SPECIAL PROBLEMS (8903) PERMIT

Spatio-temporal species distribution estimation for GeoLifeCLEF 2024 with unsupervised representation learning of remote sensing data

Objective

The objective of the special problems is to solve the GeoLifeCLEF 2024 challenge and publish a working notes paper to the CLEF 2024 conference detailing the implemented system, in collaboration with student peers at Georgia Tech. The resulting scope of work is estimated to take 3-credit hours, or 150 hours of work, by the primary author.

Background and Motivation

CLEF is the cross-language evaluation forum, an information retrieval conference with heavy emphasis of experimentation on shared tasks. GeoLifeCLEF is a challenge hosted by the LifeCLEF lab within CLEF.

GeoLifeCLEF combines five million heterogeneous presence-only records and six thousand exhaustive presence-absence surveys collected from 2017 to 2021. Models are trained with environmental data like 10-meter resolution RGB and Near-Infra-Red satellite images and climatic variables.

Data Preprocessing

We transform domain-specific geospatial rasters (GeoTIFF) into a format optimized for distributed, parallel data access patterns (Parquet). We convert an area of interest (AOI) into a regular lattice of square tiles and store relevant features cropped by the bounding box of its tile. We store all data in a Parquet dataset to load in bulk to Spark or Torch.

We create two development datasets with a maximum partition size of 1GB. The first is a subset of the data that covers a small geographic area encompassing a city, forest, and mountain. The second is a label dataset that contains the minimum features for density estimation, e.g., latitude, longitude, date, and positive indicator of species.

Modeling

Mermaid

Our system is composed of four models. We use Tile2Vec to embed geo-rasters and a linear operator estimator to embed high-dimensional time series. These models aim to learn a low-dimensional representation of the data that preserve certain geometrical properties like the triangle inequality. We fit an ordinal regression to learn the relative frequency of biodiversity across a regular lattice of features. We do this by converting positive examples into ranked lists generated by nearest neighbor labels in feature space and fitting a learning-to-rank model. We finally learn a generative model of the data to generate biodiversity rasters and images using priors from ordinal regression.

Our baseline model is a species model derived from geolocation and date. We measure improvement upon the baseline by adding learned geo and time series embeddings via ablation study.

End-to-end Task

We submit the results of our system, intending to reach first place on the leaderboard. We intend to see significant improvements between baseline models and more complex models. In addition to submitting to the leaderboard, we generate detailed rasters/images of various species for visualization.

Timeline

LifeCLEF 2024 | ImageCLEF / LifeCLEF - Multimedia Retrieval in CLEF

Jan 2024: registration opens for all LifeCLEF challenges
Jan-March 2024: training and test data release
6 May 2024: deadline for submission of runs by participants
13 May 2024: release of processed results by the task organizers
31 May 2024: deadline for submission of working note papers by participants [CEUR-WS proceedings]
24 June 2024: notification of acceptance of participant's working note papers [CEUR-WS proceedings]
8 July 2024: camera ready copy of participant's working note papers and extended lab overviews by organizers
9-12 Sept 2024: CLEF 2024 Grenoble - France

Date	Week	Task/Topic	Deliverable/Events
2024-01-08	1	Engineering - Download training and testing dataset from 2023/2024	Competition start
2024-01-15	2	Exploratory Data Analysis
2024-01-22	3	Engineering - Schema and Parquet
2024-01-29	4	Engineering - Schema and Parquet	Parquet datasets in GCS, dev set of data (\<1GB single partition) available for exploratory modeling
2024-02-05	5	Modeling - Learning to Rank
2024-02-12	6	Modeling - Gaussian Mixture Models and Stochastic Variational Inference
2024-02-19	7	Modeling - Tile2Vec
2024-02-26	8	Modeling - Tile2Vec
2024-03-04	9	Modeling - Tile2Vec
2024-03-11	10	Modeling - Koopman Operator, SVD, Dynamic Mode Decomposition	Working notes of dataset and model description
2024-03-18	11	Spring Break
2024-03-25	12	Engineering - Embedding cache, indexing and search Modeling - Ordinal regression
2024-04-01	13	Engineering - Model pipeline	First submission to the competition, screenshot of leaderboard
2024-04-08	14	Engineering - Model pipeline
2024-04-15	15	Ablation Study, Hyperparameter Tuning
2024-04-22	16	Ablation Study, Hyperparameter Tuning
2024-04-29	17	Finals, Working notes	Submission deadline for competition, first draft of working notes, screenshot of leaderboard, parquet dataset in GCS
2024-05-06	18	Summer, Working notes revision

Infrastructure

Code is hosted on GitHub at https://github.com/dsgt-kaggle-clef/geolifeclef-2024. Cloud compute and storage is on Google Cloud Platform with a personal billing account.

Collaboration and Supervision

This project stems from collaboration within the Data Science at Georgia Tech (DS@GT) student group. Prior submissions from the DS@GT team to the CLEF conference have won $5,000 worth of prizes across two best working note competitions.

As the DS@GT GeoLifeCLEF 2024 team lead, I would be collaborating with two fellow OMSCS students. The time-commitment estimate (3 credit hours) is for independent work that I carry out in the context of shared responsibilities in the team.

The supervising faculty member for the project is responsible for administration, such as registration and grading, with no expectation to advise the research process (although pointers are greatly appreciated). The supervisor will grade using an article for publication in a state ready for early review.

References

Botella, C., Deneu, B., Marcos, D., Servajean, M., Estopinan, J., Larcher, T., ... & Joly, A. (2023). The GeoLifeCLEF 2023 Dataset to evaluate plant species distribution models at high spatial resolution across Europe. arXiv preprint arXiv:2308.05121., https://arxiv.org/abs/2308.05121

Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., & Ermon, S. (2018). Tile2Vec: unsupervised representation learning for spatially distributed data. arXiv., https://arxiv.org/abs/1805.02855

Brunton, S. L., & Kutz, J. N. (2019). Data-driven science and engineering: Machine learning, dynamical systems, and control. Cambridge University Press., https://www.cambridge.org/core/books/datadriven-science-and-engineering/77D52B171B60A496EAFE4DB662ADC36E

Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007, June). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (pp. 129-136)., https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf

Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research., https://jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf

Forming Teams

Finding a Team

Before seeking a team, it is essential to have a clear idea of your research interests. A shared interest in the competition's subject matter is the foundation of a motivated team. For example, you should first identify a competition or dataset that genuinely interests you, such as BirdCLEF for avian bioacoustics or CheckThat! for fact-checking language models.

You can find teammates through the primary ARC Interest Group, within your academic courses (e.g., Deep Learning, Machine Learning), or on community platforms like the OMSCS Slack and the Ed Research Board.

Team Structure and Roles

Once a group with a shared research interest is formed, the first step is to establish a clear structure and define roles.

Team Lead

The Team Lead acts as the primary facilitator and organizer. This role is not necessarily the lead researcher but is responsible for the team's operational integrity. Key responsibilities include coordinating regular team meetings at a cadence that works for all members, ensuring all members are aware of competition deadlines and rules, serving as the main liaison between the team and the broader ARC organization, and taking ultimate responsibility for ensuring that final submissions are completed correctly and on time.

Team Member

Team members are the core contributors to the research project. All members are expected to be active participants. Key responsibilities include actively contributing ideas, code, and text for the final paper, allocating a significant amount of time to the project (estimated at 100-150 hours over the semester, roughly equivalent to a 3-unit course), engaging with the team on a regular basis with a minimum of once per week recommended, and being transparent about capacity while communicating early if unable to continue.

Communication and Collaboration Tools

For group communication, teams will use the main Data Science at Georgia Tech Slack for organization-wide news and can create private channels or use Microsoft Teams or Discord for internal discussion. Collaborative writing should be done using Overleaf for LaTeX or Google Docs for other documents. All code must be version-controlled using GitHub, with repositories hosted in the official Data Science at Georgia Tech ARC GitHub organization.

Conflict Resolution and Team Dynamics

Proactively manage team dynamics by setting clear expectations at the project's start regarding communication, workload, and standards. It is important to maintain professional empathy, recognizing that all participants are managing other commitments. Grant grace for unforeseen circumstances, but also hold team members accountable for their commitments. Should conflicts arise that cannot be resolved internally, teams should utilize available resources by seeking guidance from experienced members or ARC group leadership. Ultimately, consistent, early, and clear communication is the most effective tool for preventing and resolving team conflicts.

Conducting Experiments

Once a team is formed and a proposal is in place, the core of the research work begins with conducting experiments. This phase is heavily reliant on the ability to implement and manage large-scale systems. The evaluation-focused competitions we participate in often involve datasets ranging from tens to hundreds of gigabytes, demanding efficient data processing and robust code.

The Experimental Workflow

The process begins with downloading the dataset and performing a thorough exploratory data analysis (EDA). The goal of EDA is to develop a deep understanding of the data's characteristics, including its schema, the size and nature of the train/test splits, and the statistical properties of its main features. Following this analysis, you must design your experiments, starting with a clear and simple baseline. For an information retrieval task, this might involve running a BM25 keyword search. You then implement your novel methodology, which is intended to improve upon this baseline. A key part of this stage is conducting ablation studies, where you systematically remove components from your system to isolate and quantify their individual contribution to the overall performance.

Organization and Reproducibility

As you conduct these experiments, meticulous organization of both code and data is paramount to ensure your results are reproducible. While specific organizational strategies are left to individual teams, it is essential to keep a detailed log of all experiments, their parameters, and their outcomes. This can be managed in a spreadsheet or integrated directly into your paper draft. Furthermore, you must be familiar with the standard evaluation formats required by the venue, such as the TREC-style format common in information retrieval conferences.

Team Collaboration and Project Management

Beyond the technical execution, conducting experiments successfully is a significant project management challenge. A critical task for the team is to strategically break down the work into smaller, independent components that members can tackle in parallel. This process of decomposing the problem requires constant and clear communication among all team members. Effective communication is the most crucial tool for navigating the complexities of collaborative research, ensuring that workloads are distributed effectively and the project remains on track.

Writing a Paper

A primary goal of the ARC group is to publish research. Because applied research competitions provide a well-defined evaluation task, dataset, and metric, our focus in writing is to clearly document the system we build and any novel contributions within our workflow.

The Structure of a Competition Paper

A typical competition paper is organized into several key sections. The introduction provides an overview of the paper, context for the dataset and your solution, and the central thesis of your work. This is followed by the background and related work section, which situates your research by discussing past work in the competition, related systems, and any technical context necessary to understand your solution.

The core of the paper is the methodology, where you detail the unique aspects of your system. This includes the tools used, data transformations performed, and any ablation studies conducted to determine the contribution of different components. It should begin with simple baselines that your work improves upon. Crucially, this section must contain all necessary details to ensure your work is reproducible. After the methodology, the results section presents quantitative findings from applying your methods to the data, including pipeline runtimes and performance scores for all systems tested.

The discussion section offers your interpretation of the results and their implications. This is where you analyze why the system behaved as it did and discuss ideas that arose from the experimental outcomes. This section should also include a discussion of future work, outlining potential research directions if you had more time. Finally, the conclusion briefly summarizes the paper's main contributions and closes out the work.

The Writing Process and Timeline

While the intensive coding phase often precedes focused writing, preliminary sections like the literature review can be drafted early in the semester. The main writing effort typically requires 20 to 40 hours over a period of two to four weeks. It is advisable to begin this process as early as possible.

All papers should be written in a collaborative LaTeX environment like Overleaf, using the templates provided by the group or the conference venue. The quality of writing should be high, similar to that expected in a graduate-level course like Machine Learning. The goal is to clearly document the results of the hard work already completed during the research phase.

Using Generative AI Tools

Generative AI tools can be leveraged responsibly as part of the writing process. They are effective for assistive tasks like formatting data into tables or helping to find related works for a literature review. However, you are strongly discouraged from using these tools to generate large portions of your paper, especially the methodology, results, or discussion. Doing the analysis and writing by hand is a critical part of understanding the research domain and demonstrating your comprehension of the work. Using AI to automate the core analysis is a form of academic dishonesty and cheats you of a key learning experience. Always be transparent about your use of these tools and ensure you are representing your school and lab responsibly.

Submission and Peer Review

Once submitted, your paper will undergo peer review. Acceptance rates vary significantly by venue. For workshops like those at CLEF, our group has a high likelihood of acceptance, though you may be required to make revisions based on reviewer feedback. For more selective conferences, the bar for acceptance is much higher. If a paper is not accepted at a particular venue, the work can always be shared publicly by uploading it as a preprint to a server like arXiv.

Applied Methods

Work in Progress

Environment Setup

SSH and Git Setup

Authenticate to GitHub using GitHub CLI

This section streamlines the authentication process to GitHub using the GitHub CLI gh, which simplifies the SSH setup.

You can find the GitHub CLI installation instructions here.

Run gh auth login to begin the authentication process.
When prompted, select SSH as the preferred protocol for Git operations.
If you don't already have an SSH key, gh will prompt you to generate one. Follow the on-screen instructions to create a new SSH key.
gh will automatically add your SSH key to your GitHub account. Follow any additional prompts to complete the process.
After completing the setup, run gh auth status to check if you're successfully authenticated.

If you want to do it manually, check the GitHub page: Generating a new SSH key and adding it to the ssh-agent

Verify GitHub User Information (Optional)

It's good practice to ensure your Git identity is correctly set:

Check Git Configurations: Run git config --list to see your Git configurations, including user name and email.
Set Git User Information If Not Set: If not already set, configure your Git user information:

git config --global user.email "[email protected]"
git config --global user.name "Your Name"

Replace with your GitHub email and name.

Configuring SSH Host Aliases

It is useful to setup your ~/.ssh/config on your host as follows:

Host pace
    HostName login-phoenix.pace.gatech.edu
    User your_username

Host pace-interactive
    HostName atl1-1-02-007-30-1.pace.gatech.edu
    User your_username
    ProxyJump pace

This adds host aliases pace and pace-interactive.

Make sure to add your SSH key to the ~/.ssh/authorized_keys section after logging into PACE via ssh [email protected].

You can now access PACE using ssh pace and it will automatically log you in. The pace-interactive alias will use the login node as a jump host, allowing you to run VS Code sessions on interactive sessions.

Read more about SSH config files: ssh_config(5) manual page

Add Authorized Keys to PACE

# Log into PACE
ssh pace

# Create .ssh directory if it doesn't exist
mkdir -p ~/.ssh

# Create or append to authorized_keys file
nano ~/.ssh/authorized_keys

# Paste your public key, save and exit (Ctrl+O, Enter, Ctrl+X)

# Set correct permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

# Test your connection from your local machine
ssh pace

Updating pace-interactive Alias

Allocate a new interactive session on PACE. For example:

salloc --account=paceship-dsgt_clef2026 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=2:00:00 --qos=embers

Make sure to keep this terminal around. Get the hostname from the session:

$ hostname
atl1-1-02-007-30-1.pace.gatech.edu

Copy the hostname and update your ~/.ssh/config file:

Host pace-interactive
    HostName atl1-1-02-007-30-1.pace.gatech.edu
    User your_username
    ProxyJump pace

Then you can SSH via ssh pace-interactive from your host machine through the terminal or VS Code.

Note that this will also allow you to port forward any services running on these nodes.

Advanced SSH Configuration

Port Forwarding for Development

Common port forwarding scenarios for research work:

# Forward Jupyter notebook (local 8888 -> PACE 8888)
ssh -L 8888:localhost:8888 pace-interactive

Working with Git on PACE

Basic Git Setup on PACE

# SSH to PACE
ssh pace

# Load Git module (if using module system)
module load git

# Configure Git if not done already
git config --global user.name "Your Name"
git config --global user.email "[email protected]"

# Set VS Code as default editor if available
git config --global core.editor "code --wait"

# Verify configuration
git config --list

Clone and Work with Repositories

# Clone your research repository
git clone [email protected]:username/your-research-project.git

# Or clone using the GitHub CLI
gh repo clone username/your-research-project

Python Setup

Using UV for Python management

UV is a modern, fast Python package installer and resolver written in Rust. It's designed to be a drop-in replacement for pip and pip-tools, with significantly faster dependency resolution and installation.

Installing UV

# On Linux/macOS
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Via pip (if you already have Python)
pip install uv

Basic UV Usage

# Install packages
uv pip install numpy pandas

UV vs Traditional Tools

Speed: UV is 10-100x faster than pip for dependency resolution
Lock files: Built-in support for lock files with uv.lock
Resolution: More reliable dependency resolution
Compatibility: Drop-in replacement for most pip commands

Package Management and Dependencies

Use pyproject.toml for managing project dependencies. This file allows you to specify your project's metadata and dependencies in a standardized way as defined by PEP 621.

Modern Dependency Management with UV

Instead of manually editing requirements files, use uv add to add dependencies to your pyproject.toml:

# Add core dependencies
uv add numpy pandas matplotlib scikit-learn

# Add development dependencies
uv add --dev jupyter pytest black

# Add optional dependencies for specific features
uv add --optional ir "pyterrier>=0.9.0" "python-terrier>=0.4.0"

Example `pyproject.toml`

Use uv init to create a pyproject.toml file with the necessary structure. It should look something like this:

[project]
name = "arc-project"
version = "0.1.0"
description = "Research project for ARC"
authors = [{name = "Your Name", email = "[email protected]"}]
dependencies = [
    "numpy>=1.24.0",        # https://numpy.org/
    "pandas>=2.0.0",        # https://pandas.pydata.org/
    "matplotlib>=3.7.0",    # https://matplotlib.org/
    "scikit-learn>=1.3.0",  # https://scikit-learn.org/
    "torch>=2.0.0",         # https://pytorch.org/
    "transformers>=4.30.0", # https://huggingface.co/transformers/
]

Installing Dependencies

# Install base dependencies
uv sync

# Install with optional IR dependencies
uv sync --extra ir

# Install with development dependencies
uv sync --extra dev

# Install everything
uv sync --all-extras

Virtual Environments

Use uv venv to create and manage virtual environments easily. This will create a .venv directory in your project folder, which isolates your Python environment.

uv venv
source .venv/bin/activate  # Linux/macOS

Essential Libraries

Core Data Science Stack

Package	Description
numpy	Fundamental package for numerical computations in Python.
pandas	Data manipulation and analysis library, providing data structures like DataFrames.
matplotlib	Plotting library for creating static, animated, and interactive visualizations in Python.
scikit-learn	Machine learning library for Python, providing simple and efficient tools for data mining
scipy	Library for scientific and technical computing, building on NumPy.

Machine Learning and Deep Learning

Package	Description
torch	PyTorch library for deep learning, providing tensor computations and neural network capabilities.
transformers	Hugging Face library for working with transformer models and datasets, particularly in NLP.
datasets	Hugging Face library for accessing and processing datasets.
tokenizers	Fast tokenizers for NLP preprocessing.

Information Retrieval and Search

Package	Description
pyterrier	Python framework for information retrieval experimentation and research.
pyserini	Lucene-based toolkit for reproducible information retrieval research.
faiss-cpu	Facebook AI Similarity Search library for efficient similarity search and clustering.
sentence-transformers	Library for sentence, text and image embeddings using transformer models.

Workflow and Pipeline Management

Package	Description
luigi	Workflow management system for building complex data pipelines.
wandb	Weights & Biases for experiment tracking and model management.

Development and Productivity

Package	Description
jupyter	Interactive computing environment for notebooks.
tqdm	Progress bars for Python loops and iterables.
rich	Library for rich text and beautiful formatting in the terminal.

Jupyter Setup

Installing Jupyter Lab/Notebook

Local Installation

Install Jupyter using UV (recommended for modern Python projects):

# Add Jupyter to your project
uv add jupyter

# Or install globally
uv tool install jupyter

# Alternative: Install [JupyterLab](https://jupyterlab.readthedocs.io/) (more modern interface)
uv add jupyterlab

# Or install both
uv add jupyter jupyterlab

Verify Installation

# Check Jupyter installation
jupyter --version

# Check JupyterLab installation
jupyter lab --version

# List available kernels
jupyter kernelspec list

Running Jupyter on PACE

Basic Setup on PACE

# SSH into PACE
ssh pace

# Load Python module (if using module system)
module load python/3.11

# Create or activate your virtual environment
source ~/.venvs/research-env/bin/activate

# Install Jupyter in your environment
uv add jupyterlab

# Alternative: using pip if UV not available
pip install jupyterlab

Only use login nodes for light testing. For actual work, use interactive or batch jobs.

# Quick test on login node (use sparingly)
jupyter lab --no-browser --port=8888

# Better: specify IP to avoid conflicts
jupyter lab --no-browser --ip=0.0.0.0 --port=8888

Running Jupyter on Interactive Nodes (Recommended)

Method 1: Interactive Session + Port Forwarding

# 1. Allocate interactive session
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=4:00:00 --qos=inferno

# 2. Note the allocated node hostname
hostname
# Example output: atl1-1-02-007-30-1.pace.gatech.edu

# 3. Update your SSH config (from local machine)
./update-pace-interactive.sh atl1-1-02-007-30-1.pace.gatech.edu

# 4. Start Jupyter on the interactive node
jupyter lab --no-browser --ip=0.0.0.0 --port=8888

# 5. From another terminal on your local machine, forward the port
ssh -L 8888:localhost:8888 pace-interactive

Method 2: SLURM Batch Job for Long-Running Notebooks

Create a SLURM script jupyter_job.slurm:

#!/bin/bash
#SBATCH --job-name=jupyter-server
#SBATCH --account=paceship-dsgt_clef2025
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=8:00:00
#SBATCH --qos=inferno
#SBATCH --output=jupyter-%j.out
#SBATCH --error=jupyter-%j.err

# Load modules
module load python/3.11

# Activate environment
source ~/.venvs/research-env/bin/activate

# Get the node hostname
NODE=$(hostname)
echo "Jupyter server running on node: $NODE"
echo "Use this command to connect:"
echo "ssh -L 8888:$NODE:8888 pace"

# Start Jupyter
jupyter lab --no-browser --ip=0.0.0.0 --port=8888

Submit and monitor the job:

# Submit the job
sbatch jupyter_job.slurm

# Check job status
squeue -u $USER

# View output (contains connection instructions)
cat jupyter-JOBID.out

Port Forwarding for Remote Access

Simple Port Forwarding

# Forward port 8888 from PACE to your local machine
ssh -L 8888:localhost:8888 pace

# If using interactive node
ssh -L 8888:localhost:8888 pace-interactive

# Multiple ports (Jupyter + MLflow + TensorBoard)
ssh -L 8888:localhost:8888 -L 5000:localhost:5000 -L 6006:localhost:6006 pace-interactive

VS Code Integration

If using VS Code with Remote SSH:

Connect to PACE via Remote SSH
Open terminal in VS Code
Start Jupyter: jupyter lab --no-browser --port=8888
VS Code will automatically offer to forward the port
Click the notification or go to Ports tab

Best Practices for Notebook Organization

Project Structure

research-project/
├── notebooks/
│   ├── 01-data-exploration.ipynb
│   ├── 02-preprocessing.ipynb
│   ├── 03-model-training.ipynb
│   ├── 04-evaluation.ipynb
│   └── 99-final-results.ipynb
├── src/
│   ├── __init__.py
│   ├── data/
│   ├── models/
│   └── utils/
├── data/
│   ├── raw/
│   ├── processed/
│   └── external/
├── pyproject.toml
└── README.md

Notebook Naming Conventions

# Use numbered prefixes for workflow order
01-data-exploration.ipynb
02-feature-engineering.ipynb
03-model-training.ipynb
04-evaluation.ipynb

# Use descriptive names with dates for experiments
2025-01-15-bert-fine-tuning.ipynb
2025-01-16-ensemble-methods.ipynb

# Separate exploration from production
exploratory/
├── data-analysis-jan-15.ipynb
└── model-experiments.ipynb
production/
├── final-model-training.ipynb
└── evaluation-metrics.ipynb

VS Code Setup

Installing VS Code

Download and Install

Download VS Code: Go to Visual Studio Code website
Choose your platform:
Windows: Download .exe installer
macOS: Download .dmg file
Linux: Download .deb (Ubuntu/Debian) or .rpm (Red Hat/Fedora)

Recommended Extensions

Install these extensions for a complete data science setup:

Python: Official Python extension with IntelliSense, debugging, and linting
Ruff: Fast Python linter and formatter
Jupyter: Native notebook support in VS Code
Remote - SSH: Connect to remote machines via SSH

Remote SSH Extension Setup

Initial Configuration

Follow the SSH and Git Setup guide to configure your SSH connection to PACE.

Install Remote SSH Extension: ms-vscode-remote.remote-ssh
Open Command Palette: Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (macOS)
Type: "Remote-SSH: Connect to Host"
Enter Host: Use your PACE SSH configuration (e.g., pace or pace-interactive)

Remote Development Tips

Port Forwarding in VS Code

Automatic Detection: VS Code detects running services and offers to forward ports
Manual Forwarding:
Open Command Palette (Ctrl+Shift+P)
Type "Ports: Focus on Ports View"
Click "Forward a Port"
Enter port number (e.g., 8888 for Jupyter)

Configuring Python Environment

Python Interpreter Selection

Open Command Palette: Ctrl+Shift+P
Type: "Python: Select Interpreter"
Choose from:
System Python
Virtual environments

Working with Jupyter Notebooks in VS Code

Native Jupyter Support

VS Code provides native Jupyter notebook support:

Open .ipynb files directly in VS Code
Create new notebooks: Ctrl+Shift+P → "Jupyter: Create New Jupyter Notebook"
Select kernel: Click kernel name in top-right corner

As long as you have Jupyter installed into your Python environment, you can run notebooks seamlessly. The Python environment is ideally a virtual environment.

Jupyter Server Configuration

Connect to Remote Jupyter Server

Start Jupyter on PACE:

ssh pace-interactive
jupyter lab --no-browser --ip=0.0.0.0 --port=8888

Connect VS Code:
Open Command Palette (Ctrl+Shift+P)
Type "Jupyter: Specify Jupyter Server for Connections"
Enter server URL: http://localhost:8888
Enter token from Jupyter output

PACE Setup

Work in Progress

Getting Access to PACE

Account Types and Limits

Student Accounts: Free tier with limited compute hours
Research Allocations: Group allocations with shared compute time
Storage: Home directory (50GB) + group storage allocation

Connecting to PACE via SSH

SSH Configuration (Recommended)

See the SSH and Git Setup guide for detailed instructions on configuring your SSH connection.

# After first successful login
ssh pace

# Check your environment
hostname
whoami
pwd
df -h $HOME

# Check available modules
module avail

Understanding the PACE Environment

Cluster Architecture

PACE consists of multiple clusters:

Phoenix: Primary cluster with modern hardware
Login nodes: General access, file management, job submission
Compute nodes: CPU and GPU nodes for actual computation
Storage: High-performance parallel file systems
ICE: Specialized cluster for certain workloads
Different hardware configurations
May have different software availability

Node Types

Purpose: File management, job submission, light development
Limitations:
No intensive computation (kills jobs after 30 minutes)
Shared among all users
Limited memory and CPU
Use for: Editing files, submitting jobs, basic testing

Compute Nodes

CPU Nodes: Various configurations (8-64 cores, 32GB-1TB RAM)
GPU Nodes: NVIDIA GPUs (V100, A100, RTX series)
Access via: SLURM job scheduler only
Use for: Training models, running experiments, intensive computation

Software Environment

Module System

PACE uses environment modules to manage software:

# List available modules
module avail

# Search for specific software
module avail python
module avail cuda
module avail torch

# Load modules
module load python/3.11
module load cuda/11.8

# List loaded modules
module list

# Unload modules
module unload python/3.11
module purge  # unload all

# Show module details
module show python/3.11

Resource Allocation System

Quality of Service (QoS) Levels

inferno: Default queue, higher priority, long running jobs
embers: Low priority, pre-emptible jobs with 1 hour guaranteed runtime

Account Structure

# Check your allocations
pace-quota

File System and Storage

TODO: Add details about file systems, storage options, and best practices for data management.

Basic SLURM Commands

Job Submission

Interactive Jobs

# Request interactive session
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=4G --time=2:00:00 --qos=inferno

# Request GPU node
salloc --account=paceship-dsgt_clef2025 --nodes=1 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --gres=gpu:1 --time=1:00:00 --qos=inferno

# Exit interactive session
exit

Batch Jobs

Create a SLURM script job.slurm:

#!/bin/bash
#SBATCH --job-name=my-experiment
#SBATCH --account=paceship-dsgt_clef2025
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=4G
#SBATCH --time=4:00:00
#SBATCH --qos=inferno
#SBATCH --output=job-%j.out
#SBATCH --error=job-%j.err

# Load modules
module load python/3.11
module load cuda/11.8

# Activate environment
source ~/.venvs/research-env/bin/activate

# Run your script
python train_model.py --config configs/bert.yaml

Submit the job:

sbatch job.slurm

Job Management

# Check job queue
squeue -u $USER

# Check all jobs for your account
squeue -A paceship-dsgt_clef2025

# Check job details
scontrol show job JOBID

# Cancel job
scancel JOBID

# Cancel all your jobs
scancel -u $USER

Job Monitoring

# Check running jobs
squeue -u $USER -t RUNNING

# Monitor resource usage of running job
ssh to_compute_node
htop
nvidia-smi  # for GPU usage

Common SLURM Parameters

Resource Requests

# CPU jobs
--nodes=1                    # Number of nodes
--ntasks=1                   # Number of tasks (usually 1 for Python)
--cpus-per-task=8            # CPU cores per task
--mem-per-cpu=4G             # Memory per CPU core
--time=4:00:00              # Wall time (HH:MM:SS)

# GPU jobs
--gres=gpu:1                 # Request 1 GPU
--gres=gpu:rtx_6000:1       # Request specific GPU type
--gres=gpu:2                 # Request 2 GPUs

# Memory options
--mem=32G                    # Total memory for job
--mem-per-cpu=4G            # Memory per CPU core

Job Control

--job-name=my-job           # Job name
--output=job-%j.out         # Output file (%j = job ID)
--error=job-%j.err          # Error file
--mail-type=ALL             # Email notifications
[email protected]  # Email address

Best Practices

Resource Management

Start small: Test with short jobs first
Request only what you need: Don't waste resources
Use checkpointing: Save progress for long jobs

Concepts

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the most important first step you'll take after choosing a competition and its dataset. It's a practical, hands-on exercise to understand what you're working with, which will guide the entire direction of your research. This guide will walk you through the key steps.

Getting Started: Loading Data

The first step is always to download the data and load it into a suitable analysis tool. The tool you choose depends on the size of the dataset. For smaller datasets that fit in memory (megabytes to a few gigabytes), libraries like Pandas or Polars are excellent. For larger datasets (tens or hundreds of gigabytes), a distributed computing framework like PySpark is necessary to process the data efficiently.

For this guide, we'll use the classic 20 Newsgroups dataset, which is small enough for Pandas.

import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Fetch the dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Convert to a Pandas DataFrame for easier manipulation
df = pd.DataFrame({
    'text': newsgroups.data,
    'target': newsgroups.target
})

# Map target index to the actual category name
df['category'] = df['target'].apply(lambda x: newsgroups.target_names[x])

print(df.head())

Core EDA Tasks

Once the data is loaded, you can begin the analysis. The goal is to move from high-level statistics to a deeper understanding of the data's structure and content.

Basic Statistics and Schema

Start by getting a feel for the dataset's size and shape. You should answer these basic questions:

How many rows are there? This gives you the overall scale.
What is the schema? What are the column names and their data types?
What are the text statistics? Calculate the average number of words, sentences, and tokens per document. Token counts are especially useful for estimating the potential cost of using large language models (LLMs).

# Get the number of rows
num_rows = len(df)
print(f"Number of documents: {num_rows}")

# Get the schema (column names and types)
print("\nSchema:")
print(df.info())

# Get basic text statistics
df['word_count'] = df['text'].apply(lambda x: len(x.split()))
print("\nBasic Text Statistics:")
print(df['word_count'].describe())

Distribution Analysis

Next, look at the distribution of your inputs and outputs.

Token Frequency: Tokenize the entire dataset and plot the frequency of each token. In natural language, this distribution almost always follows a Zipfian (log-log) distribution, where a few words are very common and most words are rare.
Class Distribution: If you're doing a classification task, plot the number of examples in each category. Real-world datasets are often unbalanced or skewed, with some categories having many more examples than others. Understanding this imbalance is critical for model training and evaluation.

import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of the output classes
plt.figure(figsize=(12, 6))
sns.countplot(y=df['category'], order=df['category'].value_counts().index)
plt.title('Distribution of Categories in 20 Newsgroups Dataset')
plt.xlabel('Number of Documents')
plt.ylabel('Category')
plt.show()

Semantic Analysis and Baseline Modeling

For a deeper analysis, you can explore the semantic content of the data.

Topic Modeling: Use a simple topic model like Latent Dirichlet Allocation (LDA) to discover the hidden thematic structures within the text of different categories. This can help you understand if the categories are semantically distinct.
Build a Baseline Model: You don't need a complex model for EDA. A simple approach is to embed your text using a pre-trained model and then train a basic classifier, like Logistic Regression, on those embeddings. This gives you a quick performance baseline and helps verify that your data is suitable for the task.

Best Practices for Your EDA Notebook

Your EDA is not a one-off script; it's a foundational document for your project. To make it useful for yourself and your teammates, you should:

Be helpful to your future self. Your analysis should be easy to understand weeks or months later.
Use titles and markdown headers to structure your notebook into logical sections.
Write explanatory text. Don't just show code and plots. Write a few sentences explaining what you're doing and what your findings are. You can even use generative AI to help you phrase your thoughts if writing isn't your strong suit.

Ultimately, the goal of EDA is to explore, ask questions, and build an intuition for the data that will inform every subsequent decision you make in the competition.

Introduction to Embeddings

An embedding is a technique used to represent high-dimensional data, like text or images, as a fixed-size vector of numbers in a lower-dimensional space. The key idea is that this new representation captures the semantic meaning of the original data, so items with similar meanings will have vectors that are close to each other. This is incredibly useful because it's much easier to work with vectors than with raw text or pixels.

Generating and Visualizing Embeddings

You'll typically use a pre-trained model from a library like Sentence-Transformers (built on Hugging Face) to generate embeddings. These models have been trained on vast amounts of data and have learned to create meaningful vector representations.

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# A list of sentences to embed
sentences = [
    "This is an example sentence.",
    "Each sentence is converted to a vector.",
    "Semantic search is a common application."
]

# Generate embeddings
embeddings = model.encode(sentences)

print(embeddings.shape)
# Expected output: (3, 384), where 3 is the number of sentences
# and 384 is the dimension of the embedding vector.

Once you have your data embedded into a matrix (e.g., an n_documents x d_dimensions matrix), it's hard to understand what those numbers mean directly. The best way to get an intuitive feel is to visualize them. You can use a dimensionality reduction technique like PCA or a manifold learning algorithm like UMAP or t-SNE to project your high-dimensional vectors down to 2D, which can then be plotted.

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Assume 'embeddings' is your N x D matrix from the previous step
# Assume 'labels' is an array of corresponding labels for each document

# Reduce dimensions to 2D for plotting
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Create a scatter plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c=labels, cmap='viridis')
plt.title("2D Visualization of Document Embeddings")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(handles=scatter.legend_elements()[0], labels=unique_labels)
plt.show()

Common Applications

Embeddings are not just for visualization; they are the foundation for many powerful techniques used in modern machine learning and information retrieval.

Semantic Search

Instead of matching keywords, semantic search finds documents based on their conceptual meaning. This is done by embedding a search query and then finding the document vectors that are closest to it in the embedding space, typically using cosine similarity. For large-scale search, Approximate Nearest Neighbor (ANN) libraries like Faiss are used to find the "good enough" nearest neighbors very quickly.

import faiss
import numpy as np

# Assume 'doc_embeddings' is your N x D matrix of document embeddings
dimension = doc_embeddings.shape[1]

# 1. Build a Faiss index
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings.astype('float32')) # Faiss requires float32

# 2. Embed a query
query_text = ["Find me news about new technology"]
query_embedding = model.encode(query_text).astype('float32')

# 3. Search the index
k = 5 # Number of nearest neighbors to retrieve
distances, indices = index.search(query_embedding, k)

print(f"Top {k} most similar document indices: {indices}")

Transfer Learning and Re-ranking

Embeddings are a form of transfer learning. The knowledge learned by a large foundation model is "transferred" to your task through its vector representations. You can use these vectors as features to train simpler models for tasks like classification or regression.

Re-ranking is a more advanced two-stage search technique.

Retrieval: Use a fast method (like BM25 keyword search or a Faiss index) to retrieve an initial set of candidate documents (e.g., the top 100).
Re-ranking: Use a more powerful, but slower, model like a cross-encoder to re-evaluate and re-order just this small set of candidates to get a more accurate final ranking. The cross-encoder takes a (query, document) pair and outputs a relevance score.

from sentence_transformers.cross_encoder import CrossEncoder

# Load a pre-trained cross-encoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# The query and the documents retrieved from the first stage
query = "Find me news about new technology"
retrieved_docs = [
    "A new AI chip was announced today.",
    "Global stock markets are down.",
    "The latest smartphone features a foldable screen."
]

# Create pairs of (query, document)
sentence_pairs = [[query, doc] for doc in retrieved_docs]

# The cross-encoder predicts a relevance score for each pair
scores = cross_encoder.predict(sentence_pairs)

# Sort documents by the new scores
sorted_docs = sorted(zip(scores, retrieved_docs), key=lambda x: x[0], reverse=True)
print("Re-ranked Documents:", sorted_docs)

Why This Matters for Competitions

Understanding and using embeddings is critical for success in many competitions. They allow you to leverage the power of massive foundation models efficiently. Whether you're building a search system, a classifier, or a recommendation engine, being able to generate, visualize, and apply embeddings will give you a significant advantage. Be sure to practice with these tools, but also be mindful that generating embeddings for very large corpora can be computationally expensive.

Information Retrieval Basics

What is Information Retrieval?

Information Retrieval (IR) is the process of finding relevant information from a large collection of unstructured data (usually documents) that satisfies an information need. The most prominent example of an IR system is a web search engine like Google, which indexes web pages and ranks them based on a user's query. Recently, Large Language Models (LLMs) have become a popular interface for accessing information. Their ability to understand and organize information in a semantic space makes them powerful tools, often used in conjunction with traditional IR systems.

Core Components and System Anatomy

An IR system is fundamentally composed of a document collection to be searched, a query representing the user's information need, and a result set containing a ranked list of relevant documents. At an abstract level, this can be viewed as a K-Nearest Neighbor (KNN) problem: given a query, find the K most similar documents.

Building a modern search system begins with data representation. You might use sparse representations like BM25, which represent text based on keywords and are excellent for direct term matching. Alternatively, you could use dense representations, or embeddings, from neural networks. These encode text into a semantic vector, capturing meaning and enabling searches for concepts, with cosine similarity typically used to measure distance.

To search efficiently, this data must be indexed. The choice of index follows the representation. For sparse data, an inverted index is standard, mapping keywords to the documents containing them. For dense vectors, Approximate Nearest Neighbor (ANN) indexes like HNSW, Ball Trees, or KD-Trees are used to find the closest vectors quickly in high-dimensional space. Most high-performance systems use a two-stage architecture to balance speed and accuracy. The first stage uses a fast method to generate a large set of candidates with high recall. The second stage then uses a more complex model to re-rank these candidates for high precision.

Evaluation, Applications, and Tools

To measure the effectiveness of an IR system, you can use metrics like Precision@K, which measures the fraction of relevant documents in the top K results, or Normalized Discounted Cumulative Gain (nDCG), which rewards systems for placing more relevant documents higher in the ranking.

Key applications for IR are widespread. Beyond classic web search, IR is used for log analysis in systems like Elasticsearch to find patterns and anomalies. A major modern application is Retrieval-Augmented Generation (RAG), where an IR system retrieves relevant context to help an LLM generate more accurate and factual responses.

Several libraries are available for building these systems. For traditional sparse IR, you might use the Lucene library or toolkits built on it like Anserini and PyTerrier. When working with dense vector search, Faiss from Meta AI is a prominent choice. For building complete, full-stack systems, Elasticsearch is a widely used distributed search and analytics engine.

Large Language Models

Work in Progress

Understanding Transformer Architecture

Pre-training vs Fine-tuning

Hugging Face Transformers Library

Parameter-Efficient Fine-Tuning (PEFT)

LLM Inference and Deployment Considerations

PACE Containers

Work in Progress

Introduction to Apptainer/Singularity

Building Custom Containers

Running Containers on PACE

GPU Support in Containers

Container Best Practices for Reproducibility

Workflow Management

Work in Progress

Experiment Tracking with MLflow and WandB

Version Control for Data Science

SLURM Job Management and Monitoring

Reproducible Environments

Pipeline Orchestration Tools

Cookbook

This sections contains a collection of guides for common tasks.

Resources

Publications

Welcome to DS@GT ARC's publication collection. This section contains research papers and work produced by our team members across various academic venues and competitions.

Overview

Our research spans multiple domains including biomedical NLP, fact verification, multimodal AI, and information retrieval. We actively participate in international evaluation campaigns and publish our findings to contribute to the broader research community.

Current Publications

2025

• CLEF 2025 - Conference and Labs of the Evaluation Forum
- 22 papers across multiple tracks including BioASQ, CheckThat!, LifeCLEF, and more
- Topics: Biomedical QA, Fact Verification, Bioacoustics, Scientific Text Simplification

CLEF 2025

BioASQ

Enhancing Biomedical Named Entity Recognition using GLiNER-BioMed with Targeted Dictionary-Based Post-processing for BioASQ 2025 task 6 by Ritesh Mehta
CLEF journal | arXiv
Beyond Retrieval: Ensembling Cross-Encoders and GPT Rerankers with LLMs for Biomedical QA by Shashank Verma, Fengyi Jiang, Xiangning Xue
CLEF journal | arXiv

CheckThat!

DS@GT at CheckThat! 2025: Detecting Subjectivity via Transfer-Learning and Corrective Data Augmentation by Maximilian Heil, Dionne Bang
CLEF journal | arXiv
DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification by Maximilian Heil, Aleksandar Pramov
CLEF journal | arXiv
DS@GT at CheckThat! 2025: Ensemble Methods for Detection of Scientific Discourse on Social Media by Ayush Parikh, Hoang Thanh Thanh Truong, Jeanette Schofield, Maximilian Heil
CLEF journal | arXiv
DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization by Aleksandar Pramov, Jiangqin Ma, Bina Patel
CLEF journal | arXiv
DS@GT at CheckThat! 2025: Exploring Retrieval and Reranking Pipelines for Scientific Claim Source Retrieval on Social Media Discourse by Jeanette Schofield, Shuyu Tian, Hoang Thanh Thanh Truong, Maximilian Heil
CLEF journal | arXiv

eRisk

DS@GT at eRisk 2025: From Prompts to Predictions, Benchmarking Early Depression Detection with Conversational Agent-Based Assessments and Temporal Attention Models by Anthony Miyaguchi, David Guecha, Yuwen Chiu, Sidharth Gaur
CLEF journal | arXiv

EXIST

Tackling Sexism in Multimodal Social Media: Exploring Hybrid Generative-Transformer models by Moiz Ali, Lakshmi Yendapalli, Bishoy Tawfik, Matt Winzenried
CLEF journal | arXiv

ImageCLEF

Cultivating Multimodal Intelligence: Interpretive Reasoning and Agentic RAG Approaches to Dermatological Diagnosis by Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar
CLEF journal | arXiv

JOKER

Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings for CLEF JOKER 2025 Task 2 by Russell Taylor, Benajmin Herbert, Michael Sana
CLEF journal | arXiv

LifeCLEF

Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification by Murilo Gustineli, Anthony Miyaguchi, Adrian Cheung, Divyansh Khattak
CLEF journal | arXiv
Distilling Spectrograms into Tokens: Fast and Lightweight Bioacoustic Classification for BirdCLEF+ 2025 by Anthony Miyaguchi, Murilo Gustineli, Adrian Cheung
CLEF journal | arXiv
DS@GT AnimalCLEF: Triplet Learning over ViT Manifolds with Nearest Neighbor Classification for Animal Re-identification by Anthony Miyaguchi, Chandrasekaran Maruthaiyannan, Charles R. Clark
CLEF journal | arXiv
Transfer Learning and Mixup for Fine-Grained Few-Shot Fungi Classification by Jason Kahei Tam, Murilo Gustineli, Anthony Miyaguchi
CLEF journal | arXiv

LongEval

DS@GT at LongEval: Evaluating Temporal Performance in Web Search Systems and Topics with Two-Stage Retrieval by Anthony Miyaguchi, Imran Afrulbasha, Aleksandar Pramov
CLEF journal | arXiv

PAN

Binoculars, BART, and Adversaries: Multi-Faceted AI Text Detection for PAN 2025 by Benjamin Ostrower, Poonam Dongare, Mahitha Thekkinkattuvalappil Unnikrishnan
CLEF journal | arXiv

QuantumCLEF

Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering by Chloe Pomeroy, Aleksandar Pramov, Karishma Thakrar, Lakshmi Yendapalli
CLEF journal | arXiv

SimpleText

Hallucination Detection and Mitigation in Scientific Text Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText by Krishna Chaitanya Marturi, Heba H. Elwazzan
CLEF journal | arXiv
LLM-Guided Planning and Summary-Based Scientific Text Simplification: DS@GT at CLEF 2025 SimpleText by Krishna Chaitanya Marturi, Heba H. Elwazzan
CLEF journal | arXiv

Touché

DS@GT at Touché: Large Language Models for Retrieval-Augmented Debate by Anthony Miyaguchi, Conor Johnston, Aaryan Potdar
CLEF journal | arXiv

TalentCLEF

Multilingual Job Title Matching with MPNet-Based Sentence Transformers by Adam Brikman, Michael Sana, Holden Ruegger
CLEF journal | arXiv