pyscrapper.assembly.urlloaders
¶
API¶
-
class
pyscrapper.assembly.urlloaders.
UrlLoader
(pool, headers=None)¶ Bases:
pyscrapper.assembly.observers.Observable
- An interface which provides methods to load an url and shutdown current urlloader.
- Each UrlLoader is a sub class of
Observable
, which lets the urlloader hold and notify the observers on url is loaded.
-
load_url
(url, **kwargs)¶ Loads url using any of selected pool ( ThreadPool / ProcessPool )
-
shutdown
(wait=True)¶ Shuts down the UrlLoader :param wait=True: waits until existing queue of url’s has been loaded
-
class
pyscrapper.assembly.urlloaders.
BrowserLessUrlLoader
(pool=None, max_workers=None, headers=None, **kwargs)¶ Bases:
pyscrapper.assembly.urlloaders.UrlLoader
A concrete implementation of UrlLoader interface, which has a ThreadPoolExecutor to execute the URL requests concurrently, in a browser less context ( Incapable of lazy loading by javascript ). On URL response is received, The response is pushed to the observers: Observer it holds.
-
add_observer
(observer: pyscrapper.assembly.observers.Observer)¶ Add observer to observers list
-
load_url
(url, **kwargs)¶ Load the given url as http request and push response to observers
-
shutdown
(wait=True)¶ Shuts down the UrlLoader :param wait=True: waits until existing queue of url’s has been loaded
-
-
class
pyscrapper.assembly.urlloaders.
PhantomUrlLoader
(pool=None, driver_path='/home/docs/checkouts/readthedocs.org/user_builds/pyscrapper/checkouts/latest/pyscrapper/resources/phantomjs', max_workers=None, headers=None, **kwargs)¶ Bases:
pyscrapper.assembly.urlloaders.UrlLoader
A concrete implementation of UrlLoader interface, which has a ThreadPoolExecutor to execute the URL requests concurrently, in a browser based context ( capable of handling lazy loading by javascript ). It uses PhantomJS headless web browser to load the urls. On URL response is received, The response is pushed to the observers: Observer it holds.
-
add_observer
(observer: pyscrapper.assembly.observers.Observer)¶ Add observer to observers list
-
load_url
(url, **kwargs)¶ Parameters: - url – URL to be loaded by the url loader
- pre_exec – This parameter takes a method/function as input and calls that method/function passing the selenium web driver object into it. The method/function is called before given url is loaded by the driver
- post_exec – This parameter takes a method/function as input and calls that method/function passing the selenium web driver object into it. The method/function is called after given url is loaded by the driver
Note
These features pre_exec, post_exec allow developers to perform some extra operations on the web driver, by directly accessing the webdriver. This has been provided with an intuition that, some elements take long time to appear on the web browser. But, the web browser
-
shutdown
(wait=True)¶ Shuts down the UrlLoader :param wait=True: waits until existing queue of url’s has been loaded
-