importio #3

  • By ian-clark
  • Latest version (#3)
  • precise
  • Stable
  • Edge

Description

This is the import.io's application running as a command-line crawler. Once provided with a relation and configuration it will
crawl a site based on the configuration provided into the relation provided.

For more information on the command-line crawler see our support page here:- http://support.import.io/knowledgebase/articles/325728


Overview

This charm sets up a machine to run the import.io application as a command line crawler. Use this charm to crawl your target sites and push the data directly into your target application.

The target application needs to be something that can take json documents posted over http. Currently the only application support is the elasticsearch application that just works(tm).

For more details of the import.io command-line crawler functionality please read:-

Command-Line Crawler Instructions

For more details of the import.io command-line crawler settings please read:-

Crawler Settings

Usage

Deploy the charm by doing this:

juju deploy importio

Currently you need elasticsearch also running

juju deploy elasticsearch

juju add-relation importio elasticsearch

Known Limitations and Issues

Currently the only target we stream json documents into is elasticsearch, in theory other data stores would work as well.

Configuration

The configuration does not ship with defaults for most settings. Easiest way is to:-

juju set importio --config /path/to/config.yaml

with a yaml file like so:-

connectorGuid: 
startUrls:
maxDepth:
crawlTemplate:
dataTemplate:
connections:
pause:
apiKey:
userGuid:

Contact Information

If you have any problems with this charm, ideas or improvements please contact us at:- support@import.io or http://support.import.io/

Future Plans

Configuration

apiKey
(string) This is your api key, you can create one of these after logging into this page:- http://import.io/data/account/
connectorGuid
(string) Get the connector guid from a Crawler you have already setup from the 'my data' page:- http://import.io/data/mine/
crawlTemplate
(string) Sets the parameters of the URL pattern of the sites you want to crawl. For example, if you were only interested in crawling the beauty section at boots, you would set the where to crawl as: www.boots.com/beauty. This is helpful, because the fewer unnecessary places your crawler has to travel looking for data, the more efficient it will be at returning it.
connections
(int) The number of pages the crawler will attempt to visit at the same time. The higher you set this number, the faster you will get data. WARNING: We do not recommend to using any value higher than 5 if you are not crawling your own domain, as you may be blocked by the owner of the site.
2
dataName
(string) This is a name you nominate for the data and is used as the elasticsearch type name
pause
(int) Indicates how long the crawler will wait (in seconds) before moving from one page to the next. The smaller you set this number, the faster data will be returned. WARNING: We do not recommend setting it to zero, as you may be blocked by the owner of the site.
1
maxDepth
(int) This is the maximum number of clicks from the start URL the crawler will travel to find data. By default it is set to 10 (the maximum allowed) to enable you to get all the data. However, the fewer clicks the crawler needs to travel, the quicker your data will be returned so if possible, it is a good idea to set this to a lower number.
3
crawlName
(string) This is a name you nominate for the crawl and is used as the elasticsearch index name
dataTemplate
(string) This is the URL pattern of your example pages. The crawler will try to extract data from any page that matches that pattern. For more details on the syntax of this template see this page:- http://support.import.io/knowledgebase/articles/247574-advanced-crawler-options
userGuid
(string) This is your import.io user id, you can find this after logging into this page:- http://import.io/data/account/
startUrls
(string) By default, the crawler will start from the pages you gave as examples. However, it is sometimes more efficient to start from somewhere more central to the site (like the homepage).