Grasp information of multiple papers from PubMed and arXiv

Researchers often need to search for a lot of literatures. Sometimes, they want to collect the information of those literatures and classified them into different categories. Obviously, it is very inconvenient and inefficient to do this manually. However, we can automize this process with API. Here, I introduce this method with examples.

Basically, API is an interface providing developers with information access. We can use API to do the same searching using the same keywords and parameters as what we do on the webpage, and API can give us the results in xml format. Therefore, to do this task, we need 2 Python packages, Requests to access APIs and Beautiful Soup to process XML.

# install requests and bs4 (Beautiful Soup)
conda install requests beautifulsoup4

# if you use pip
pip install requests beautifulsoup4

PubMed API

At first, let’s explore PubMed API. Its detailed documentation can be found here. This time, I want to talk about the usage two of PubMed APIs: esearch.fcgi? and efetch.fcgi?. esearch.fcgi? responds to a text query with a list of literature IDs, while efetch.fcgi? responds to a list of IDs in a given database with corresponding data records.

Step 1, use esearch.fcgi? to get literatures IDs.

import requests
from bs4 import BeautifulSoup

# we need to tell the API our query terms, database (PubMed this time) and how maximum 
# records we want to see
def get_id(query, db = 'pubmed', max_record = 10000):
  # URL of the API
  search_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
  # parameters for this API
  parameters = {'term': query, 'db': db, 'retmax': max_record}
  # get the data with requests.get
  response = requests.get(search_url, params = parameters)
  # since results are in XML format, use beautiful soup to process it
  soup = BeautifulSoup(response.text, 'xml')
  # extract IDs
  id_list = list(soup.IdList.stripped_strings)

  return id_list

Step 2, use efetch.fcgi? to get literatures data with previous IDs.

# to use this API, we need to provide a list of IDs, database (PubMed again), output 
# results format (XML) and maximum number of records
def fetch_info(id_list, db = 'pubmed', retmode = 'xml', max_record = 10000):
  # similar with above
  fetch_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
  parameters = {'db': db, 'id': ids, 'retmode': retmode, 'retmax': max_record}
  response = requests.get(records_url, params = parameters)
  soup = BeautifulSoup(response.text, 'xml')
  # results information are labeled with various tags, simply check with print(soup.text)

  # get articles information
  article_info = soup.find_all('PubmedArticle')
  # for each article, extract its title, and publication year. You can also extract 
  # other information using different tags
  for i in article_info:
    title = i.ArticleTitle.text

    try:
      pub_year = article.PubDate.Year.text
    except AttributeError:
      pub_year = 'Unknown'

    # print out the results
    print('Title: ' + title)
    print('Year: ' + pub_year)

Finally, simply put them together, say, let’s search for papers about deep learning and cancer.

query = 'deep learning cancer'
id_list = get_id(query)
fetch_info(id_list)
# then it outputs results to the console. And we can save them into a file for follow-up
# analysis

arXiv API

This principle is totally the same to arXiv API. arXiv API is even more simpler than PubMed API because it just needs one API.

def fetch_info_arxiv(query, max_record = 10000):
  url = 'http://export.arxiv.org/api/query'
  parameters = {'search_query': query, 'max_results': max_record}
  response = requests.get(url, params = parameters)
  soup = BeautifulSoup(response.text, 'xml')
  article_info = soup.find_all('entry')
  # extract title, authors, and publication year for each article
  for i in article_info:
    title = i.title.text

    authors = []
    try:
      author_list = i.find_all('author')
      for t in author_list:
        author = t.text.strip()
        authors.append(author)
    except AttributeError:
      pass
    
    # pub_date = 'YYYY-MM-DDThh:mm:ssZ'
    pub_date = i.published.text
    pub_year = pub_date.split('-')[0]

    # print out the results
    print('Title: ' + title)
    print('Authors: ' + ', '.join(authors))
    print('Year: ' + pub_year)

# search for deep learning
fetch_info_arxiv('deep learning')

In conclusion, API allows us to fetch much information with a few lines of codes, so it is very useful and easy to learn. Besides, APIs share the same working principles. Unfortnately, not every website has APIs. For example, bioRxiv does not have any APIs. But, at least, we have moved forward.