sparkler #1

Description

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.


Overview

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

Usage

Sparkler has dependencies on Java and Solr, also optionally, Spark so to deploy we do:

juju deploy openjdk java
juju deploy cs:~spiculecharms/apache-solr solr
juju deploy cs:~spiculecharms/sparkler
juju add-relation solr sparkler
juju add-relation java sparkler
juju add-relation solr java

Scale out Usage

Currently we don't support scaleout.

Known Limitations and Issues

Bad documentation.....

Configuration

Contact Information

Contact the developers here:

Upstream Project Name

Configuration

generate-topn
(string) Generates the top N URLs for fetching.
1000
crawldb-uri
(string) Override the auto detected crawldb uri
kafka-enable
(boolean) Enable Kafka dump
plugins-bundle-directory
(string) Plugins Bundle directory. Configured through Maven.
${project.parent.basedir}${file.separator}${project.bundles.directory}
spark-master
(string) Override the auto detected spark uri
kafka-listeners
(string) Override the Kafka listeners
fetcher-server-delay
(string) Delay (in milliseconds) between two fetch requests for the same host.
1000
generate-top-groups
(string) Generates the Top Groups
256
kafka-topic
(string) The Kafka topic
sparkler_%s