Thanks Heather,

On 14/05/2024 15:50, Carleton, Heather (CDC/NCEZID/DFWED/EDLB) via Bioinfo List wrote:
If you are interested in enterics, we have some of this concordance data on https://wwwn.cdc.gov/narmsnow/
There is also the AMRFinder validation paper from NCBI that does phenotype to genotype correlations https://www.ncbi.nlm.nih.govg/pmc/articles/PMC6811410/

Great, will definitely look into those!

For false negatives, do you mean phenotypically resistant but AMR gene or mutation not identified?

Yes. The issue I see with the machine learning papers is that even if they split the data into a training set (the data that the model is allowed to see and learns to recognise; usually 80% of the available data) and a validation set (the remaining 20%, which the model has never seen and is asked to predict in order to assess its validity), the nature of the data we're dealing with is such that obtaining a high validation rate is (a) almost inevitable and (b) not helpful for assessing generalisability.

What I mean is this: if I train a deep neural net on 80% of the ResFinder database, while keeping 20% (randomly selected) away from it, then - due to orthology - I suspect it will classify most of the unseen 20% correctly, simply because it had seen a sufficiently similar sequence in the training set.

It's cool that we can train a model to learn this, of course, but we could already do that the classical way: just blast the unknown sequence against the known 80%. Effectively, we've trained a model to measure sequence similarity.

I'm being a bit unfair here: what is remarkable is that even if you thin out the data to not contain sequences with more than 90% mutual similarity (as in the ARGnet paper), the models still attain a high validation rate. Presumably DNNs somehow accommodate for rearrangements that we don't capture with similarity metrics. They are especially strong at predicting non-linear relationships and detecting structure in high-dimensional spaces - typically the problem area for GWAS. Maybe individual resistance genes aren't quite the right "problem level" for DNNs.

Generalisability is a separate issue: no matter how well a model predicts the validation set, what we are interested in is predicting _novel_ resistance. But how do we know the model is correct? This is biology, we first need to see it happen in real life.

That's where the false negatives come in! Isolates that we know are resistant, but that have no matches with the known ARG databases. Those will be interesting cases to challenge deep learning models with.

Cheers
Marco