opendap.crawler
Class ResponseCachePostgres

java.lang.Object
  extended by opendap.crawler.ResponseCachePostgres

public class ResponseCachePostgres
extends java.lang.Object

Provide a cache for XML/HTTP response objects. This can hold both the DDX XML/Text version of the response/object and the Last Modified Time (LMT). The class thus can provide the basis for a simple HTTP 1.1 cache where a conditional GET can be used to eliminate repeat transfers of a Response. The cache can also be used directly to process a collection of Responses retrieved earlier. The cache uses Postgres to store the information responses (i.e., documents) and a ConcurrentHashMap (that it serializes those to disk files for persistence) to store the LMT times of the URLs visited. In order for Postgres to be used as the cache, the database must be setup. There must be a database called 'crawl_cache' and it should have a table whose name will be passed to the constructor of this class using the 'tableName' parameter. That table should have columns named: key, url and doc. Make these using "CREATE TABLE ddx_responses (key SERIAL PRIMARY KEY, url varchar(256), doc text);" where 'ddx_responses' is the value of 'tableName'. This class was modified from a version that could optionally use a hash map to store the responses.

Author:
jimg

Nested Class Summary
 class ResponseCachePostgres.ResponseCacheKeysEnumeration
           
 class ResponseCachePostgres.ResponseVisitedKeysEnumeration
           
 
Constructor Summary
ResponseCachePostgres(boolean readOnly, java.lang.String cacheName, java.lang.String tableName)
          Build an instance of ResponseCachePostgres.
ResponseCachePostgres(java.lang.String cacheName, java.lang.String tableName)
          Build an instance of ResponseCachePostgres.
 
Method Summary
protected  void finalize()
          This won't be called when an out of memory exception is thrown.
 java.lang.String getCachedResponse(java.lang.String URL)
          Retrieve a Response document from the cache.
 long getLastVisited(java.lang.String URL)
          When was this Response URL last visited?
 java.util.Enumeration<java.lang.String> getLastVisitedKeys()
          Get all of the keys in the URL cache named when this class was instantiated.
 java.util.Enumeration<java.lang.String> getResponseKeys()
          Get all of the keys in the Response document (postgres) cache.
 boolean isVisited(java.lang.String URL)
          Has this URL been visited?
 void saveState()
          Force the cache to save its state now.
 void setCachedResponse(java.lang.String URL, java.lang.String doc)
          Add a Response document to the cache using its URL as a key.
 void setLastVisited(java.lang.String URL, long d)
          Add or update the entry in Response URL cache.
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ResponseCachePostgres

public ResponseCachePostgres(java.lang.String cacheName,
                             java.lang.String tableName)
                      throws java.lang.Exception
Build an instance of ResponseCachePostgres. By default this cache supports both reads and writes. It will zero-out any previously cached responses.

Parameters:
cacheName - The basename to use for the 'visited' cache.
tableName - The name for the Postgres table in the 'crawl_cache' database where responses should be stored.
Throws:
java.lang.Exception
See Also:
ResponseCachePostgres(boolean readOnly, String cacheName, String tableName)

ResponseCachePostgres

public ResponseCachePostgres(boolean readOnly,
                             java.lang.String cacheName,
                             java.lang.String tableName)
                      throws java.lang.Exception
Build an instance of ResponseCachePostgres.

Parameters:
readOnly - Is this cache object being created so that a client program can read previously cached data? If this is true, the client should never try to write to the cache.
cacheName - The basename to use for the 'visited' cache.
tableName - The name for the Postgres table in the 'crawl_cache' database where responses should be stored.
Throws:
java.lang.Exception
Method Detail

finalize

protected void finalize()
                 throws java.lang.Exception
This won't be called when an out of memory exception is thrown.

Overrides:
finalize in class java.lang.Object
Throws:
java.lang.Exception

saveState

public void saveState()
               throws java.lang.Exception
Force the cache to save its state now.

Throws:
java.lang.Exception

setCachedResponse

public void setCachedResponse(java.lang.String URL,
                              java.lang.String doc)
                       throws java.lang.Exception
Add a Response document to the cache using its URL as a key.

Parameters:
URL - The URL
doc - The Docuemnt
Throws:
java.lang.Exception

getCachedResponse

public java.lang.String getCachedResponse(java.lang.String URL)
                                   throws java.lang.Exception
Retrieve a Response document from the cache.

Parameters:
URL - Get the document paired with this URL
Returns:
The document
Throws:
java.lang.Exception

getResponseKeys

public java.util.Enumeration<java.lang.String> getResponseKeys()
                                                        throws java.lang.Exception
Get all of the keys in the Response document (postgres) cache. It's likely that you want to use the keys from the 'visited' cache instead.

Returns:
An Enumeration that can be used to access all of the keys in the cache. Use getCachedResponse(key) to get the Response docuements.
Throws:
java.lang.Exception
See Also:
getLastVisitedKeys()

getLastVisited

public long getLastVisited(java.lang.String URL)
When was this Response URL last visited?

Parameters:
URL - The URL
Returns:
The time when this URL was last visited or 0 if it's never been looked at. Time is given in seconds since 1 Jan 1970.

isVisited

public boolean isVisited(java.lang.String URL)
Has this URL been visited?

Parameters:
URL - The URL
Returns:
true if the URL has been visited.

setLastVisited

public void setLastVisited(java.lang.String URL,
                           long d)
                    throws java.lang.Exception
Add or update the entry in Response URL cache. This is used to store Last Modified Times for a given URL. The time used is initially the current time. If the Response URL has been previously visited (see getLastVisited()), then that LMT can be used with a conditional HTTP GET request. Of course, this cache can be used in other ways too.

Parameters:
URL - The URL
d - The Last modified time to be paired with the URL
Throws:
java.lang.Exception

getLastVisitedKeys

public java.util.Enumeration<java.lang.String> getLastVisitedKeys()
Get all of the keys in the URL cache named when this class was instantiated. This returns all of the keys in Visited cache, not the Postgres database cache; several 'visited' cached might use a single table in the crawl_cache postgres database.

Returns:
An Enumeration that can be used to access all of the keys in the cache. Use getLastVisited(key) to get the response LMT times.