org.xenbase.scraper
Class BasicScraper
java.lang.Object
org.xenbase.scraper.BasicScraper
- Direct Known Subclasses:
- Scraper_CurrBio_DevCell_Cell, Scraper_DevDyn, Scraper_Development, Scraper_JCellBio, Scraper_MechDev_DevBio, Scraper_PNAS
public abstract class BasicScraper
- extends java.lang.Object
Method Summary |
byte[] |
getData(java.lang.String url)
Takes a URL in String format and returns a byte array of the contents of the site
at the the URL provided. |
abstract java.lang.String |
getRedirURL(java.lang.String url)
Because we are using URLs from pubmed and because each journal
publisher's website is different, we need to go through a series of HTTP
301 redirects, then search the resulting page to find the URL of the full
article. |
abstract ScrapedData |
scrape(java.lang.String url)
This is the actual function that takes the URL (produced by getRedirURL)
and returns the images and captions of that article. |
Methods inherited from class java.lang.Object |
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
BasicScraper
public BasicScraper()
scrape
public abstract ScrapedData scrape(java.lang.String url)
throws java.lang.Exception,
java.lang.Error
- This is the actual function that takes the URL (produced by getRedirURL)
and returns the images and captions of that article. This is the core of the
scraper, and obviously each webpage is different, and so different string
parsing is done for different journals.
- Parameters:
String
- URL - Direct URL to full article (usually produced by getRedirURL(String url)
- Returns:
- ScrapedData - The Object containing all the images and captions
- Throws:
java.lang.Exception
java.lang.Error
getRedirURL
public abstract java.lang.String getRedirURL(java.lang.String url)
throws java.lang.Exception,
java.lang.Error
- Because we are using URLs from pubmed and because each journal
publisher's website is different, we need to go through a series of HTTP
301 redirects, then search the resulting page to find the URL of the full
article. Because each publisher website is different, this function
needs to be unique for each journal publisher website.
- Parameters:
url
- - URL to full article from PubMed
- Returns:
- String - Containing actual URL of full journal article
- Throws:
java.lang.Exception
java.lang.Error
getData
public byte[] getData(java.lang.String url)
throws java.lang.Exception,
java.lang.Error
- Takes a URL in String format and returns a byte array of the contents of the site
at the the URL provided. This is how pages and images are actually downloaded. This function
makes use of the Apache HttpClient class, as it was one of the few HTTP classes
that provides sufficient functionality for browser spoofing which was required
in order to correctly access the journal websites.
- Parameters:
String
- URL
- Returns:
- byte[]
- Throws:
java.lang.Exception
java.lang.Error