pyscrapper.assembly.urlloaders

API

class pyscrapper.assembly.urlloaders.UrlLoader(pool, headers=None)

Bases: pyscrapper.assembly.observers.Observable

  • An interface which provides methods to load an url and shutdown current urlloader.
  • Each UrlLoader is a sub class of Observable, which lets the urlloader hold and notify the observers on url is loaded.
load_url(url, **kwargs)

Loads url using any of selected pool ( ThreadPool / ProcessPool )

shutdown(wait=True)

Shuts down the UrlLoader :param wait=True: waits until existing queue of url’s has been loaded

class pyscrapper.assembly.urlloaders.BrowserLessUrlLoader(pool=None, max_workers=None, headers=None, **kwargs)

Bases: pyscrapper.assembly.urlloaders.UrlLoader

A concrete implementation of UrlLoader interface, which has a ThreadPoolExecutor to execute the URL requests concurrently, in a browser less context ( Incapable of lazy loading by javascript ). On URL response is received, The response is pushed to the observers: Observer it holds.

add_observer(observer: pyscrapper.assembly.observers.Observer)

Add observer to observers list

load_url(url, **kwargs)

Load the given url as http request and push response to observers

shutdown(wait=True)

Shuts down the UrlLoader :param wait=True: waits until existing queue of url’s has been loaded

class pyscrapper.assembly.urlloaders.PhantomUrlLoader(pool=None, driver_path='/home/docs/checkouts/readthedocs.org/user_builds/pyscrapper/checkouts/latest/pyscrapper/resources/phantomjs', max_workers=None, headers=None, **kwargs)

Bases: pyscrapper.assembly.urlloaders.UrlLoader

A concrete implementation of UrlLoader interface, which has a ThreadPoolExecutor to execute the URL requests concurrently, in a browser based context ( capable of handling lazy loading by javascript ). It uses PhantomJS headless web browser to load the urls. On URL response is received, The response is pushed to the observers: Observer it holds.

add_observer(observer: pyscrapper.assembly.observers.Observer)

Add observer to observers list

load_url(url, **kwargs)
Parameters:
  • url – URL to be loaded by the url loader
  • pre_exec – This parameter takes a method/function as input and calls that method/function passing the selenium web driver object into it. The method/function is called before given url is loaded by the driver
  • post_exec – This parameter takes a method/function as input and calls that method/function passing the selenium web driver object into it. The method/function is called after given url is loaded by the driver

Note

These features pre_exec, post_exec allow developers to perform some extra operations on the web driver, by directly accessing the webdriver. This has been provided with an intuition that, some elements take long time to appear on the web browser. But, the web browser

shutdown(wait=True)

Shuts down the UrlLoader :param wait=True: waits until existing queue of url’s has been loaded