With 400,000 new cases per year and a high recurrence rate, bladder cancer is one of the most costly cancers for society. VitaDX is a startup that develops a software solution for early and non-invasive diagnosis of bladder cancer (one urine sample is all it takes). This diagnostic test is implemented using machine learning algorithms that analyze urinary cytology whole slide images (digitized microscope slides). These algorithms are currently being developed as part of a clinical trial conducted by VitaDX with ten hospitals in France.

Machine learning algorithms deployed in real-world situations often have lower
performances than the ones demonstrated during R&D. This can be due to a
combination of factors: the training dataset is not representative of all the possible
cases, overfitting of the evaluation dataset, the model is under-specified by the available
data, the real-world data has a distribution slightly different In medical applications, a
drop of performances can have dire consequences and should be mitigated as much as
possible. This internship goal is to tackle part of this problem: how to detect if a new
example is an anomaly and if a prediction should be made on it?
Anomaly (or outliers) detection is the identification of examples that do not fit the
distribution of the majority of the data. In our case, this would allow us to identify
samples that have been created using a different protocol and where the diagnosis
algorithm would not be meaningful. A large literature exists on the subject going from
unsupervised methods to fully-supervised ones.
Your tasks will be to: Make a complete bibliography of the subject Choose and implement state-of-the-art methods on toy data Test the chosen methods on VitaDX real problem

Ideal candidate description Good Python coding skills. Familiar with standard Python libraries (conda, numpy, scikit-learn, …). Good knowledge of classical Machine Learning and Deep Learning algorithms. Highly motivated and autonomous. Interested by the medical field. Knowledge of Deep Learning frameworks such as TensorFlow or PyTorch is a plus.
You will have access to all the slides generated by the protocol team, covering several
biological protocols and scanners.
You will be able to perform your computations either on your own machine with a GPU
or on our local computing cluster, which include several calculation nodes equipped with
recent Nvidia GPUs.

Bac + 4/5 et +
