Project ideas

Read this first

If you contact me and it is obvious from your e-mail that you did not read this page, I will ignore your request and I will not feel bad about it. If you have some idea in mind, read up on the short guide to proposing your own project idea below. If you do not, scroll down to the list of projects I have some interest in supervising. Some are better thought out than others, but I expect you to bring your ideas in them as well. I am not the kind of supervisor who gives weekly to-do lists.

A short guide to proposing your own project idea

Proposing your own idea

Because the world is a dumpster fire, and the it does not seem to be going in an upward direction, I am progressively shifting my focus to supervising projects for social good, i.e. projects which have a net positive or at the very minimum a net neutral ethical footprint. What that means in practice for you as a potential student, is that I will not supervise the following types of projects:

  • Stock price prediction projects (boring) ;
  • Optimising financial something-something (gross) ;
  • Modelling COVID-19 (leave that to the epidemiologists).

On the other hand, I am particularly interested in projects which bring a net positive to the world, such as tackling disinformation/misinformation, helping mental health professionals, or anything related to text and language. I may be convinced to supervise something outside of my area of expertise if it is incredibly cool (e.g., building a VR game) but keep in mind that lacking any expertise in it, my assistance will be limited to general project management and writing/reading.

Making sure your idea is not terrible

If you want to propose your own idea, which you think fits in my overall themes (or is just so cool you might just convert me), I am giving you an easy 7 step (1 or 2 more if you want to do something more complex) format to structure your idea into a decent project. This is not the only way to do things well, but it is a way that does things well consistently, and therefore I am more comfortable in sharing it. The goal is to structure your idea in a way that helps you plan your actual project. At the start you should be confident in your understanding of point 1 and 2, and work your way into answering the rest of the questions. I would expect you to have a good idea for all steps by the first few weeks of the project at the latest.

Research-focused project

Step 1. There is a problem in the world: [what do you want to contributed toward solving?]
Step 2. That problem is important because: [why do you want to contribute toward solving it?]
Step 3. A cursory search shows me other people who tried to fix this problem: [basic literature search]
Step 4. A better way to solve it would be to do: [what you are proposing]
Step 5. This raises the following research questions: [what are the questions you are trying to answer?]
Step 6. I will solve those research questions by performing the following experiments: [plan of work]
Step 7. I will validate my hypothesis based on the following baselines: [competing approaches]

Engineering-focused project

Step 1. There is a problem in the world: [what do you want to contributed toward solving?]
Step 2. That problem is important because: [why do you want to contribute toward solving it?]
Step 3. A cursory search shows me other people who tried to fix this problem: [basic literature search]
Step 4. A better way to solve it would be to do: [what you are proposing]
Step 5. this raises the following feature requirements: [what does your software need to succeed?]
Step 6. I will solve those requirements by implementing the following software: [plan of work]
Step 7. I will validate my requirements as follows: [how you plan to user-test your software]

Hybrid project

Step 1. There is a problem in the world: [what do you want to contributed toward solving?]
Step 2. That problem is important because: [why do you want to contribute toward solving it?]
Step 3. A cursory search shows me other people who tried to fix this problem: [basic literature search]
Step 4. A better way to solve it would be to do: [what you are proposing]
Step 5. This raises the following research questions: [what are the questions you are trying to answer?]
Step 6. I will solve those research questions by performing the following experiments: [plan of work]
Step 7. In order to perform those experiments, I need to build a piece of software with the following requirements: [what does your software need to succeed?]
Step 8a. I will evaluate my software based on the following methodology: [how you plan to user-test your software]
Step 8b. I will validate my hypothesis based on the following baselines: [competing approaches]

About good and bad research questions

In my teaching, I usually refer to a set of characteristics of a good research question which I remember using the SPAIN mnemonic, which stands for:

  • Specific: you need to refer to specific quantities and how they relate to each other. “Is algorithm X better than algorithm Y?” is a bad question because it uses undefined quantities (what is better? is it faster? more accurate?), an undefined context (for which task? sentiment analysis? argument mining? general classification?), and does not have a real criterion for answering (how much better is better? is 0.00001 better really better?).
  • Plausible: the thing you are investigating needs to be plausible. “Does coding on a black keyboard help your machine learning algorithm work better?” is a bad research question, because there is no plausible mechanism for it and therefore regardless of the result it would be a waste of time.
  • Answerable: can you actually answer this question? Do you have access to relevant data? Enough of it? Do you have access to compute to run the models? Do you have enough to run all your models in time for you to write up your project?
  • Interesting (or alternatively Impactful): this is a tricky one, but the question needs to be interesting. What is the impact of answering this question? Does knowing the answer help researchers?
  • Novel: this one is the most misunderstood by students, because the expectation of novelty is radically different for an undergraduate student, a postgraduate taught student, and a postgraduate research student. A PhD student is expected to output grade A novelty work, meaning that it is novel in an impactful way. UG and PGT students are expected to output grade B novelty work, meaning that it just needs to not be a carbon copy of an existing project. It does not need to change the landscape of research forever (although it can be excellent and publishable work), only to not be a simple copy of some Kaggle notebook you found somewhere.

List of potential projects

Here are a bunch of project ideas I would like to supervise in some form. They are not fixed in so far as you can come up with a slight variation of them and we can talk them out. They can be done at the undergraduate or the MSc level but might require slight adaptations in some cases to better fit the timeline of your degree. Some of those are research-oriented, and would fit well a student aiming for further study. Some are more engineering-focused, and would fit well a student who wants to build something cool (hopefully). I classify projects into three wide categories: (1) projects somewhat affiliated to my research group and might build on one another ; (2) one-shot projects which I think are fun and interesting but completely unrelated to my work ; (3) general lines of investigation that I am interested in, but without a clear direction (you will be expected to bring a lot more of your ideas into this).

1. CHART-affiliated projects

Project 1 – Identifying speakers from sensor array

Current voice assistants are limited in being to locate their user – they can only identify them based on the room they are in and recognising their voice. This limits their usefulness in settings such as smart homes, where being able to locate the speaker exactly and offer contextual help would be necessary. The goal of this project is to overcome this issue by being able to identify and segment the likely speaker from the input of multiple cameras and microphones placed in a room. The final system should be able to do basic tasks such as putting a bounding box around the speaker, and performing additional analysis on their position (“what am I pointing at?”) or gestures (“how many fingers am I holding?”).

Project 2 – Social AI

For embodied AI to be accepted in human society, we need AI-enabled agents to be able to read social cues and anticipate human behaviour from them, such as a human agent being uncomfortable, surprised, interested, angry, etc. This is a large project that will be broken down in multiple smaller projects.

Project 2A – An LLM-powered social bot that anticipates human behaviour from multimodal signals

The goal of this project will be two-fold:
(1) Collect a sample dataset of conversations with a person reading and acting a script in front of a camera and microphone with automated transcription ;
(2) Run a comparative analysis of baseline methods which can take as input video frames, audio signal, and transcribed text in order to make a best estimate of the person’s attitude. We will investigate the most salient features (e.g., audio pitch, facial expression, etc.)

Project 2B – An LLM-powered social bot that knows when to talk (and when to listen)

For embodied AI to be accepted in human society, we need voice interfaces to be able to determine with good accuracy when their human interlocutor is done talking. In this project, we will be investigating multimodal techniques to detect when a speaker is about to finish talking.

Project 2C – An LLM-powered social bot that knows what to remember, and what to forget

Retrieval-Augmented Generation (RAG) is a general framework for retrieving facts from a knowledge base and feeding them into a large language model in order to generate output grounded in a specific set of facts. The goal of this project is to investigate the use of RAG for continuously updating knowledge bases, by building a system that analyse user input for factual statements and use them to update the knowledge base, as well as updating (“update my address to X”) and forgetting (e.g., “forget my address”, or “forget anything related to X”) commands. You will need to possess a machine that can run some small local LLM in order to do this project.

Project 2D – A platform for LLM-to-LLM communication for artificial group decision making

In this project, you will build a platform to allow multiple (local) Large Language Models to communicate with each other. The platform will allow the user to upload LLM personas (in an undefined form you will need to investigate) as well as a topic of debate. It will then allow the personas to debate the topic until a conclusion is reached. You will have to investigate ways to efficiently operate multiple personas in parallel (and the hardware challenges this will provide) as well as ways to perform conversations that go beyond the context size limit of standard large language models. While we have access to servers for experiments, it is probably better that you possess a machine that can run some small local LLM in order to do this project.

Project 2E – Studying the impact of embodiment on LLM trustworthiness

Large Language Models have come around and overtaken a large part of the public discourse about artificial intelligence due to their uncanny ability to generate seemingly human-like text. In this project, we want to investigate the relationship of embodiment and perceived intelligence on the acceptability of an embodied AI deployed in a higher education context.


2. One-shot projects

Project 6 – Podcastifier: using an LLM and neural TTS to turn a document into a Socratic dialogue

That’s really all there is to it — can you make use of LLMs and neural TTS to process a document (e.g., a research paper, or course notes) and turn it into a Socratic-style dialogue that will help the listener understand the document?

Project 7 – Analysing the prevalence of default hyperparameters in the machine learning literature

Hyperparameters play a crucial role in steering the performance of machine learning models. However, selecting optimal hyperparameters is often challenging. Researchers may resort to default values provided by libraries, potentially impacting the replicability and true potential of published results. This project aims to develop an NLP-based framework to systematically analyse a corpus of machine learning research papers, measuring the reliance on default hyperparameters and investigating their potential impact on research findings.

Project 8 – Where did that model come from? Dissecting the DNA of machine learning models from conference papers

Machine learning research often progresses through iterations, modifications, and combinations of existing models. However, tracking the lineage and evolution of model architectures can be challenging. This project aims to develop an NLP-based framework to analyse machine learning papers, automating the identification of a model’s origins and classifying whether it represents a novel architecture, a modification, or a combination of existing techniques.

Project 9 – FashionBot

The goal of this project is to design and build a bot, trained on social media data, that can make use of computer vision to judge the outfit of a person standing in front of them and use a local large language model to produce a judgment of their fashion sense. All credit to my PhD student Giovanni Schiazza for coming up with the idea.

Project 10 – A tool for scientometric analysis of research groups

Researchers frequently need to collaborate with each other in order to accomplish significant projects. This can be a daunting process due to people’s changing research interests, goals, and responsibility and the fact that we are often dispersed in multiple areas (different universities, different countries, etc). The goal of this project is to make use of data science and scientometric techniques in order to produce a tool which allows researchers and members of the public to understand the research dynamics of a lab, a school, or an arbitrary list of researchers. The tool will do so by analysing collaboration patterns, publication trends, and providing a way to match a new researcher to potential colleagues based on their stated interests and publication lists.

Project 11 – A transcription tool for qualitative researchers

Regulations on privacy force qualitative researchers as well as therapists to rely on transcribers to turn their recorded interviews into usable text ready to be coded, due to the automated solutions not being GDPR-compliant. In this project you will build an application that uses an offline voice-to-text model such as DeepSpeech to help those researchers in quickly transcribing their interviews. The system will allow for the (1) import of a model as well as its (2) finetuning to the researcher’s voice by allowing them to read a specified text aloud, and offer a review mode to fix the model’s mistakes.

Project 12 – An AI tool to analyse interviews with LLMs

Thematic analysis is a foundational method in qualitative research, empowering researchers to identify patterns and derive insights from textual data. However, the process is often time-consuming and labour-intensive. Large Language Models offer the potential to streamline and augment thematic analysis by automating certain tasks and offering novel perspectives. This project aims to explore the integration of LLMs into the thematic analysis workflow of qualitative interview data. You will need to build an application that supports thematic analysis and uses LLMs to provide advice during the process, acting as an external annotator.

Project 13 – A browser-embedded LLM to summarise web pages

This goal of this project is to embed a large language model (LLM) into a web browser. The LLM will analyse web page content on the fly, generating accurate and informative summaries at different levels of understanding (laypeople, technical, etc.) for on-demand comprehension.


3. Open lines of investigation

Open Project 1 – SemEval challenge

For this type of project, you will need to investigate https://semeval.github.io and pick a task to solve. You will need to write and submit to me a concept paper to show that you have thought this through.

Open Project 2 – Data Science for Social Good challenge

For this type of project, you will need to come up with a data science application that can help make the world a better place in a scientifically interesting way. I recommend browsing https://www.solveforgood.org as a good starting point, but occasionally I may be posting datasets from other research groups that need to be analysed. You will need to write and submit to me a concept paper to show that you have thought this through.

Open Project 3 – Code-mixed text classification in the languages of Malaysia

Code-mixing is the mixing of vocabulary and syntax of multiple languages in the same sentence. It is a common practice in multi-cultural countries, and there is relatively little research on the building of effective language and classification models in a code-mixed context. I am interested in the investigation of code-mixed text classification in a Malaysian context, where widely different languages (such as Malay, Cantonese, Mandarin, English) can sometimes be mixed within the same sentence. One core difficulty of this project will be the procurement of realistic data, and defining clear research questions. You will need to write and submit to me a concept paper to show that you have thought this through.