Thanks Heather,
On 14/05/2024 15:50, Carleton, Heather
(CDC/NCEZID/DFWED/EDLB) via Bioinfo List wrote:
If you are interested in enterics, we have some of this concordance data on https://wwwn.cdc.gov/narmsnow/
There is also the AMRFinder validation paper from NCBI that does phenotype to genotype correlations https://www.ncbi.nlm.nih.govg/pmc/articles/PMC6811410/
Great, will definitely look into those!
For false negatives, do you mean phenotypically resistant but AMR gene or mutation not identified?
Yes. The issue I see with the machine learning papers is that even
if they split the data into a training set (the data that the model
is allowed to see and learns to recognise; usually 80% of the
available data) and a validation set (the remaining 20%, which the
model has never seen and is asked to predict in order to assess its
validity), the nature of the data we're dealing with is such that
obtaining a high validation rate is (a) almost inevitable and (b)
not helpful for assessing generalisability.
What I mean is this: if I train a deep neural net on 80% of the
ResFinder database, while keeping 20% (randomly selected) away from
it, then - due to orthology - I suspect it will classify most of the
unseen 20% correctly, simply because it had seen a sufficiently
similar sequence in the training set.
It's cool that we can train a model to learn this, of course, but we
could already do that the classical way: just blast the unknown
sequence against the known 80%. Effectively, we've trained a model
to measure sequence similarity.
I'm being a bit unfair here: what is remarkable is that
even if you thin out the data to not contain sequences with more
than 90% mutual similarity (as in the ARGnet paper), the models
still attain a high validation rate. Presumably DNNs somehow
accommodate for rearrangements that we don't capture with
similarity metrics. They are especially strong at predicting
non-linear relationships and detecting structure in
high-dimensional spaces - typically the problem area for GWAS.
Maybe individual resistance genes aren't quite the right "problem
level" for DNNs.
Generalisability is a separate issue: no matter how well a model
predicts the validation set, what we are interested in is predicting
_novel_ resistance. But how do we know the model is correct? This is
biology, we first need to see it happen in real life.
That's where the false negatives come in! Isolates that we know are
resistant, but that have no matches with the known ARG databases.
Those will be interesting cases to challenge deep learning models
with.
Cheers
Marco