pyscrapper.scrapper

Scrapper

The core scrapping module of Pyscrapper. Once the html markup is loaded from an URL, then this module comes into play. It parses the html content and form the final json object, as given in the configuration.

API

class pyscrapper.scrapper.PyScrapper(html, config, is_list=False, name='')

Each block of given configuration is parsed by an object of PyScrapper class.

Parameters:
  • html (str) – html field takes the html markup that needs to be scrapped
  • config (dict) – This field takes the configuration, which tells the parser * which part of html need to be taken and parsed * how the parsed data has to be structured
get_scrapped_config()

This method returns the parsed content

class pyscrapper.scrapper.RequestHandler

This class, holds the basic configurations by which the synchronous scrape_content method loads url.

MAX_WORKERS:
  • This property limits the RequestHandler to perform MAX_WORKERS number of request only, when the url loading is done in a multi threaded/ multi process environment.
  • Default value is set to count of cpu’s in current system.

Note

Eg. RequestHandler.MAX_WORKERS = 2 # Allows only 2 url loaders to be executed parallelly when application is in parallel execution environment.

pyscrapper.scrapper.scrape_content(url, config, to_string=False, raise_exception=True, window_size=(1366, 784), **kwargs)

It processes, the operation in a synchronized way. Takes url, configuration as parameters, loads the given url in web browser, then parses the html as per the given configuration data.

Parameters:
  • url (string) – URL of webpage to be scrapped
  • config (dict) – configuration dictionary which describes which part of html should be scraped and how it should be modelled.
  • to_string (bool) – returns the scrapped and modelled json as string
Returns:

parsed data