Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Journal Article -> main subject -> MeSH Terms #32

Open
Daniel-Mietchen opened this issue Jun 12, 2017 · 11 comments
Open

Add Journal Article -> main subject -> MeSH Terms #32

Daniel-Mietchen opened this issue Jun 12, 2017 · 11 comments
Assignees

Comments

@Daniel-Mietchen
Copy link

I'm looking for ways to get the following workflow going:

  • Check Wikidata items for cases where there is a PMID (P698) but no "main subject" (P921)
  • When such an item is found (let's call it QA here):
    • get the MeSH terms associated with that PMID and for each of them
      • look up the Wikidata item corresponding to the MeSH term (via P486)
      • set P921 of QA to QB

Would this be something that could be fit into GeneWiki workflows?

@stuppie
Copy link
Collaborator

stuppie commented Jun 12, 2017

Hi @Daniel-Mietchen . Yes, this is something that fits in to our workflow, and is something that could be added to WikidataIntegrator. This would be easy to implement provided the Wikidata item corresponding to the MeSH term already existed. Ideally, at some point we would import MeSH into Wikidata (or at least some parts of it). I imagine this would require a lot of work matching up items to prevent duplication though...

Looking at the API response from EuropePMC (and from Pubmed), the mesh terms don't actually include the mesh ID (of course, why would they?), but only the name. So either way, we would have to parse the MeSH dump file to get the MeSH ID from the name, unless you know of a better API?

We could also look into adding the items from the "chemicalList" field.

@stuppie stuppie changed the title MeSH-to-"main subject" pipeline Add Journal Article -> main subject -> MeSH Terms Jun 12, 2017
@Daniel-Mietchen
Copy link
Author

Yes, the ~7K MeSH terms that are in Wikidata at the moment (which don't seem to be differentiated in terms of being Descriptors or Concepts) are just about 1/4 of the total number of MeSH Descriptors, but perhaps a good starting point for testing such workflows. For the rest, we would probably need to think about using Mix'n Match.

In terms of getting the ID, I just left the following message with the NLM Helpdesk (ticket #28045-219557):

I am interested in getting the MeSH Descriptor IDs associated with a PubMed ID via an API call.

I know how to get the MeSH Descriptor name from a PMID via API, e.g. "Coenzymes" for https://www.ncbi.nlm.nih.gov/pubmed/16046484 via
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=16046484&tool=my_tool&email=my_email@example.com .

I am also aware that I can use regexes in SPARQL to try to convert that string into the Descriptor ID, e.g. https://id.nlm.nih.gov/mesh/query?query=PREFIX+rdf%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+xsd%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema%23%3E%0D%0APREFIX+owl%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0D%0APREFIX+meshv%3A+%3Chttp%3A%2F%2Fid.nlm.nih.gov%2Fmesh%2Fvocab%23%3E%0D%0APREFIX+mesh%3A+%3Chttp%3A%2F%2Fid.nlm.nih.gov%2Fmesh%2F%3E%0D%0APREFIX+mesh2015%3A+%3Chttp%3A%2F%2Fid.nlm.nih.gov%2Fmesh%2F2015%2F%3E%0D%0APREFIX+mesh2016%3A+%3Chttp%3A%2F%2Fid.nlm.nih.gov%2Fmesh%2F2016%2F%3E%0D%0APREFIX+mesh2017%3A+%3Chttp%3A%2F%2Fid.nlm.nih.gov%2Fmesh%2F2017%2F%3E%0D%0A%0D%0A+SELECT+%3Fd+%3FdName+%3Fc+%3FcName%0D%0A+FROM+%3Chttp%3A%2F%2Fid.nlm.nih.gov%2Fmesh%3E%0D%0A+%0D%0A+WHERE+%7B%0D%0A+++%0D%0A+%3Fd+a+meshv%3ADescriptor+.%0D%0A+%3Fd+meshv%3Aactive+1+.%0D%0A+%3Fd+meshv%3Aconcept+%3Fc+.%0D%0A+%3Fd+rdfs%3Alabel+%3FdName+.%0D%0A+%3Fc+rdfs%3Alabel+%3FcName%0D%0A+FILTER%28REGEX%28%3FdName%2C%22Coenzymes%22%2C%22i%22%29+%7C%7C+REGEX%28%3FcName%2C%22Coenzymes%22%2C%22i%22%29%29%0D%0A+%0D%0A+%7D%0D%0A+ORDER+BY+%3Fd%0D%0A%0D%0A&format=HTML&inference=true&year=current&limit=50&offset=0#lodestart-sparql-results for "Coenzymes", which gives https://id.nlm.nih.gov/mesh/D003067.html .

I do not see a direct way to convert from a PMID to the associated Descriptor IDs, though, and would appreciate pointers.

@andrewsu
Copy link
Member

A few notes from me.

  • It looks like the XML file will go directly from a PMID to a MeSH ID (eg https://www.ncbi.nlm.nih.gov/pubmed/16046484?report=xml&format=text).

  • The XML also contains an attribute for "MajorTopicYN". For P921 "main subject" you might want to consider restricting to records with a value of "Y".

  • I think a bot to systematically load MeSH (or parts of it) will be part of the renewal proposal. Of course, some manual review will definitely be necessary...

@goodb
Copy link

goodb commented Jun 13, 2017

You may find this service useful here. SPARQL over MeSH graph. https://id.nlm.nih.gov/mesh/

@Daniel-Mietchen
Copy link
Author

@andrewsu Thanks for the pointer to the XML.

I'm not sure yet what the "MajorTopicNY" switch does — in the example, it seems to only be used for qualifiers, while we would likely want to go mainly for descriptors.

@andrewsu
Copy link
Member

@Daniel-Mietchen definitely used for descriptors too, eg https://www.ncbi.nlm.nih.gov/pubmed/26611529?report=xml&format=text

@Daniel-Mietchen
Copy link
Author

Daniel-Mietchen commented Jun 14, 2017

Yes, for other PMIDs, it is used for descriptors as well, but my point from above was that if the goal is to turn MeSH IDs into P921 statements, we probably cannot solely rely on those descriptors for which "MajorTopicNY" is "Y", since that would give nothing for the PMID=16046484 example.

If we were to include those descriptors for which at least one qualifier's "MajorTopicNY" is set to "Y", then we would get

For PMID=26611529, we would get

I have made those sample edits to
https://www.wikidata.org/w/index.php?title=Q30247623&oldid=500642437#P921
and
https://www.wikidata.org/w/index.php?title=Q30274598&oldid=500644169#P921 .

Is there a way to know whether there are PMIDs for which no "MajorTopicNY" value would be "Y"?

@stuppie
Copy link
Collaborator

stuppie commented Jun 14, 2017

I think we might as well include the qualifiers as well, no? Maybe with the "use" qualifier?

@Daniel-Mietchen
Copy link
Author

Daniel-Mietchen commented Jun 14, 2017

Yes, in principle, we'd like to make use of the information provided by the qualifiers as well, but not sure P366 is the way to go.

To go back to https://www.ncbi.nlm.nih.gov/pubmed/26611529?report=xml&format=text , we have, for instance

<MeshHeading>
    <DescriptorName UI="D043524" MajorTopicYN="Y">Peptide-N4-(N-acetyl-beta-glucosaminyl) Asparagine Amidase</DescriptorName>
    <QualifierName UI="Q000737" MajorTopicYN="N">chemistry</QualifierName>
    <QualifierName UI="Q000235" MajorTopicYN="N">genetics</QualifierName>
    <QualifierName UI="Q000378" MajorTopicYN="N">metabolism</QualifierName>
</MeshHeading>

I understand this to mean that the paper is about

  • the chemistry of Peptide-N4-(N-acetyl-beta-glucosaminyl) Asparagine Amidase
  • the genetics of Peptide-N4-(N-acetyl-beta-glucosaminyl) Asparagine Amidase
  • the metabolism of Peptide-N4-(N-acetyl-beta-glucosaminyl) Asparagine Amidase

Unless we have dedicated items for such things, I could imagine translating this into statements like

  • P921 (main subject): chemistry (Q2329), with a qualifier
    • of (P642): Peptide-N4-(N-acetyl-beta-glucosaminyl) Asparagine Amidase (Q7166515)

I have made such a test edit to a sandbox item: https://www.wikidata.org/w/index.php?title=Q13406268&oldid=500822215#P921

Other issues:

  • the number of statements derived from MeSH terms might be a concern — 26611529 has 14 Descriptors, which is close to the upper end of what I think is reasonable in terms of P921, and adding in the qualifiers means going well beyond 20, which I think is adding more noise than signal
  • the hierarchy of the Descriptors, as seen through their Wikidata graph. For instance, "human" is on a subtree of "mammal", so perhaps we don't need to add "mammal" here unless it's MajorTopicYN="Y"?

@andrewsu
Copy link
Member

Yeah, have to admit I'm not a fan of that sandbox model and the use of "of". Don't have an alternate suggestion, but just sharing my gut reaction.

I'd argue that you'd only want to use P921 "main subject" if MajorTopicYN="Y". That would also cut down on the number of MeSH term statements and eliminate less informative/specific ones eg "mammal"...

@Daniel-Mietchen
Copy link
Author

Meanwhile, all MeSH IDs are in Mix'n Match, and about 14k have been matched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants