Grasp information of multiple papers from PubMed and arXiv
Researchers often need to search for a lot of literatures. Sometimes, they want to collect the information of those literatures and classified them into different categories. Obviously, it is very inconvenient and inefficient to do this manually. However, we can automize this process with API. Here, I introduce this method with examples.
Basically, API is an interface providing developers with information access. We can use API to do the same searching using the same keywords and parameters as what we do on the webpage, and API can give us the results in xml format. Therefore, to do this task, we need 2 Python packages, Requests to access APIs and Beautiful Soup to process XML.
# install requests and bs4 (Beautiful Soup)
conda install requests beautifulsoup4
# if you use pip
pip install requests beautifulsoup4
PubMed API
At first, let’s explore PubMed API. Its detailed documentation can be found here. This time, I want to talk about the usage two of PubMed APIs: esearch.fcgi? and efetch.fcgi?. esearch.fcgi? responds to a text query with a list of literature IDs, while efetch.fcgi? responds to a list of IDs in a given database with corresponding data records.
Step 1, use esearch.fcgi? to get literatures IDs.
import requests
from bs4 import BeautifulSoup
# we need to tell the API our query terms, database (PubMed this time) and how maximum
# records we want to see
def get_id(query, db = 'pubmed', max_record = 10000):
# URL of the API
search_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
# parameters for this API
parameters = {'term': query, 'db': db, 'retmax': max_record}
# get the data with requests.get
response = requests.get(search_url, params = parameters)
# since results are in XML format, use beautiful soup to process it
soup = BeautifulSoup(response.text, 'xml')
# extract IDs
id_list = list(soup.IdList.stripped_strings)
return id_list
Step 2, use efetch.fcgi? to get literatures data with previous IDs.
# to use this API, we need to provide a list of IDs, database (PubMed again), output
# results format (XML) and maximum number of records
def fetch_info(id_list, db = 'pubmed', retmode = 'xml', max_record = 10000):
# similar with above
fetch_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
parameters = {'db': db, 'id': ids, 'retmode': retmode, 'retmax': max_record}
response = requests.get(records_url, params = parameters)
soup = BeautifulSoup(response.text, 'xml')
# results information are labeled with various tags, simply check with print(soup.text)
# get articles information
article_info = soup.find_all('PubmedArticle')
# for each article, extract its title, and publication year. You can also extract
# other information using different tags
for i in article_info:
title = i.ArticleTitle.text
try:
pub_year = article.PubDate.Year.text
except AttributeError:
pub_year = 'Unknown'
# print out the results
print('Title: ' + title)
print('Year: ' + pub_year)
Finally, simply put them together, say, let’s search for papers about deep learning and cancer.
query = 'deep learning cancer'
id_list = get_id(query)
fetch_info(id_list)
# then it outputs results to the console. And we can save them into a file for follow-up
# analysis
arXiv API
This principle is totally the same to arXiv API. arXiv API is even more simpler than PubMed API because it just needs one API.
def fetch_info_arxiv(query, max_record = 10000):
url = 'http://export.arxiv.org/api/query'
parameters = {'search_query': query, 'max_results': max_record}
response = requests.get(url, params = parameters)
soup = BeautifulSoup(response.text, 'xml')
article_info = soup.find_all('entry')
# extract title, authors, and publication year for each article
for i in article_info:
title = i.title.text
authors = []
try:
author_list = i.find_all('author')
for t in author_list:
author = t.text.strip()
authors.append(author)
except AttributeError:
pass
# pub_date = 'YYYY-MM-DDThh:mm:ssZ'
pub_date = i.published.text
pub_year = pub_date.split('-')[0]
# print out the results
print('Title: ' + title)
print('Authors: ' + ', '.join(authors))
print('Year: ' + pub_year)
# search for deep learning
fetch_info_arxiv('deep learning')
In conclusion, API allows us to fetch much information with a few lines of codes, so it is very useful and easy to learn. Besides, APIs share the same working principles. Unfortnately, not every website has APIs. For example, bioRxiv does not have any APIs. But, at least, we have moved forward.