Monday, 19 December 2016

Importance of Data Mining Services in Business

Importance of Data Mining Services in Business

Data mining is used in re-establishment of hidden information of the data of the algorithms. It helps to extract the useful information starting from the data, which can be useful to make practical interpretations for the decision making.
It can be technically defined as automated extraction of hidden information of great databases for the predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making. Although data mining is a relatively new term, the technology is not. It is thus also known as Knowledge discovery in databases since it grip searching for implied information in large databases.
It is primarily used today by companies with a strong customer focus - retail, financial, communication and marketing organizations. It is having lot of importance because of its huge applicability. It is being used increasingly in business applications for understanding and then predicting valuable data, like consumer buying actions and buying tendency, profiles of customers, industry analysis, etc. It is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, e-commerce, customer relationship management and financial services.

However, the use of some advanced technologies makes it a decision making tool as well. It is used in market research, industry research and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, scientific tests, genetics, financial services and utilities.

Data mining consists of major elements:

    Extract and load operation data onto the data store system.
    Store and manage the data in a multidimensional database system.
    Provide data access to business analysts and information technology professionals.
    Analyze the data by application software.
    Present the data in a useful format, such as a graph or table.

The use of data mining in business makes the data more related in application. There are several kinds of data mining: text mining, web mining, relational databases, graphic data mining, audio mining and video mining, which are all used in business intelligence applications. Data mining software is used to analyze consumer data and trends in banking as well as many other industries.

Outsourcing Web Research offer complete Data Mining Services and Solutions to quickly collective data and information from multiple Internet sources for your Business needs in a cost efficient manner.

Sourec : http://ezinearticles.com/?Importance-of-Data-Mining-Services-in-Business&id=2601221

Tuesday, 13 December 2016

Data Extraction Services For Better Outputs in Your Business

Data Extraction Services For Better Outputs in Your Business

Data Extraction can be defined as the process of retrieving data from an unstructured source in order to process it further or store it. It is very useful for large organizations who deal with large amount of data on a daily basis that need to be processed into meaningful information and stored for later use. The data extraction is a systematic way to extract and structure data from scattered and semi-structured electronic documents, as found on the web and in various data warehouses.

In today's highly competitive business world, vital business information such as customer statistics, competitor's operational figures and inter-company sales figures play an important role in making strategic decisions. By signing on this service provider, you will be get access to critivcal data from various sources like websites, databases, images and documents.

It can help you take strategic business decisions that can shape your business' goals. Whether you need customer information, nuggets into your competitor's operations and figure out your organization's performance, it is highly critical to have data at your fingertips as and when you want it. Your company may be crippled with tons of data and it may prove a headache to control and convert the data into useful information. Data extraction services enable you get data quickly and in the right format.

Few areas where Data Extraction can help you are:

    Capturing financial data
    Generating better sales leads
    Conducting market research, survey and analysis
    Conducting product research and analysis
    Track, extract and harvest product pricing data
    Searching for specific job postings
    Duplicating an online database
    Acquiring real estate data
    Processing auction information
    Searching online newspapers for latest pricing information
    Extracting and summarize news stories from online news sources

Outsourcing companies provide custom made data extraction services to the client's requirements. The different types of data extraction services;

    Web extraction
    Database extraction

Outsourcing is the beneficial option for large organizations seeking to manage large information. Outsourcing this services helps businesses in managing their data effectively, which in turn enables business to experience an increase in profits. By outsourcing, you can certainly increase your competitive edge and save costs too!

This article is courtesy of Web Scraping Expert - an executive at Outsourcing Web Research offer high quality and time bound comprehensive range of data extraction services at affordable rates. For more info please visit us at: http://www.webscrapingexpert.com/ or directly send your requirements at: info@webscrapingexpert.com

Source:http://ezinearticles.com/?Data-Extraction-Services-For-Better-Outputs-in-Your-Business&id=2760257

Wednesday, 7 December 2016

Increasing Accessibility by Scraping Information From PDF

Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source:http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Saturday, 3 December 2016

Collecting Data With Web Scrapers

Collecting Data With Web Scrapers

There is a large amount of data available only through websites. However, as many people have found out, trying to copy data into a usable database or spreadsheet directly out of a website can be a tiring process. Data entry from internet sources can quickly become cost prohibitive as the required hours add up. Clearly, an automated method for collating information from HTML-based sites can offer huge management cost savings.

Web scrapers are programs that are able to aggregate information from the internet. They are capable of navigating the web, assessing the contents of a site, and then pulling data points and placing them into a structured, working database or spreadsheet. Many companies and services will use programs to web scrape, such as comparing prices, performing online research, or tracking changes to online content.

Let's take a look at how web scrapers can aid data collection and management for a variety of purposes.

Improving On Manual Entry Methods

Using a computer's copy and paste function or simply typing text from a site is extremely inefficient and costly. Web scrapers are able to navigate through a series of websites, make decisions on what is important data, and then copy the info into a structured database, spreadsheet, or other program. Software packages include the ability to record macros by having a user perform a routine once and then have the computer remember and automate those actions. Every user can effectively act as their own programmer to expand the capabilities to process websites. These applications can also interface with databases in order to automatically manage information as it is pulled from a website.

Aggregating Information

There are a number of instances where material stored in websites can be manipulated and stored. For example, a clothing company that is looking to bring their line of apparel to retailers can go online for the contact information of retailers in their area and then present that information to sales personnel to generate leads. Many businesses can perform market research on prices and product availability by analyzing online catalogues.

Data Management

Managing figures and numbers is best done through spreadsheets and databases; however, information on a website formatted with HTML is not readily accessible for such purposes. While websites are excellent for displaying facts and figures, they fall short when they need to be analyzed, sorted, or otherwise manipulated. Ultimately, web scrapers are able to take the output that is intended for display to a person and change it to numbers that can be used by a computer. Furthermore, by automating this process with software applications and macros, entry costs are severely reduced.

This type of data management is also effective at merging different information sources. If a company were to purchase research or statistical information, it could be scraped in order to format the information into a database. This is also highly effective at taking a legacy system's contents and incorporating them into today's systems.

Overall, a web scraper is a cost effective user tool for data manipulation and management.

source: http://ezinearticles.com/?Collecting-Data-With-Web-Scrapers&id=4223877

Wednesday, 30 November 2016

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

How Web Data Extraction Services Will Save Your Time and Money by Automatic Data Collection

Data scrape is the process of extracting data from web by using software program from proven website only. Extracted data any one can use for any purposes as per the desires in various industries as the web having every important data of the world. We provide best of the web data extracting software. We have the expertise and one of kind knowledge in web data extraction, image scrapping, screen scrapping, email extract services, data mining, web grabbing.

Who can use Data Scraping Services?

Data scraping and extraction services can be used by any organization, company, or any firm who would like to have a data from particular industry, data of targeted customer, particular company, or anything which is available on net like data of email id, website name, search term or anything which is available on web. Most of time a marketing company like to use data scraping and data extraction services to do marketing for a particular product in certain industry and to reach the targeted customer for example if X company like to contact a restaurant of California city, so our software can extract the data of restaurant of California city and a marketing company can use this data to market their restaurant kind of product. MLM and Network marketing company also use data extraction and data scrapping services to to find a new customer by extracting data of certain prospective customer and can contact customer by telephone, sending a postcard, email marketing, and this way they build their huge network and build large group for their own product and company.

We helped many companies to find particular data as per their need for example.

Web Data Extraction

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. We help you to create a kind of API which helps you to scrape data as per your need. We provide quality and affordable web Data Extraction application

Data Collection

Normally, data transfer between programs is accomplished using info structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all. That's why the key element that distinguishes data scraping from regular parsing is that the output being scraped was intended for display to an end-user.

Email Extractor

A tool which helps you to extract the email ids from any reliable sources automatically that is called a email extractor. It basically services the function of collecting business contacts from various web pages, HTML files, text files or any other format without duplicates email ids.

Screen scrapping

Screen scraping referred to the practice of reading text information from a computer display terminal's screen and collecting visual data from a source, instead of parsing data as in web scraping.

Data Mining Services

Data Mining Services is the process of extracting patterns from information. Datamining is becoming an increasingly important tool to transform the data into information. Any format including MS excels, CSV, HTML and many such formats according to your requirements.

Web spider

A Web spider is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Many sites, in particular search engines, use spidering as a means of providing up-to-date data.

Web Grabber

Web grabber is just a other name of the data scraping or data extraction.

Web Bot

Web Bot is software program that is claimed to be able to predict future events by tracking keywords entered on the Internet. Web bot software is the best program to pull out articles, blog, relevant website content and many such website related data We have worked with many clients for data extracting, data scrapping and data mining they are really happy with our services we provide very quality services and make your work data work very easy and automatic.

Source: http://ezinearticles.com/?How-Web-Data-Extraction-Services-Will-Save-Your-Time-and-Money-by-Automatic-Data-Collection&id=5159023

Wednesday, 23 November 2016

How Xpath Plays Vital Role In Web Scraping Part 2

How Xpath Plays Vital Role In Web Scraping Part 2

Here is a piece of content on  Xpaths which is the follow up of How Xpath Plays Vital Role In Web Scraping

Let’s dive into a real-world example of scraping amazon website for getting information about deals of the day. Deals of the day in amazon can be found at this URL. So navigate to the amazon (deals of the day) in Firefox and find the XPath selectors. Right click on the deal you like and select “Inspect Element with Firebug”:

If you observe the image below keenly, there you can find the source of the image(deal) and the name of the deal in src, alt attribute’s respectively.

So now let’s write a generic XPath which gathers the name and image source of the product(deal).

  //img[@role=”img”]/@src  ## for image source
  //img[@role=”img”]/@alt   ## for product name

In this post, I’ll show you some tips we found valuable when using XPath in the trenches.

If you have an interest in Python and web scraping, you may have already played with the nice requests library to get the content of pages from the Web. Maybe you have toyed around using Scrapy selector or lxml to make the content extraction easier. Well, now I’m going to show you some tips I found valuable when using XPath in the trenches and we are going to use both lxml and Scrapy selector for HTML parsing.

Avoid using expressions which contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.

Here is why: the expression .//text() yields a collection of text elements — a node-set(collection of nodes).and when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), results in the text for the first element only.

from scrapy import Selector
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>”””
sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()           # Let’s type this only once
print xp(‘//a//text()’)                                       # Take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output of above command
print xp(‘string(//a//text())’)                           # convert it to a string
  [u’Click here to go to the ‘]                           # output of the above command

Let’s do the above one by using lxml then you can implement XPath by both lxml or Scrapy selector as XPath expression is same for both methods.

lxml code:

from lxml import html
html_code = “””<a href=”#”>Click here to go to the <strong>Next Page</strong></a>””” # Parse the text into a tree
parsed_body = html.fromstring(html_code)  # Perform xpaths on the tree
print parsed_body(‘//a//text()’)                      # take a peek at the node-set
[u’Click here to go to the ‘, u’Next Page’]   # output
print parsed_body(‘string(//a//text())’)              # convert it to a string
[u’Click here to go to the ‘]                    # output

A node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> xp(‘//a[1]’)  # selects the first a node
[u'<a href=”#”>Click here to go to the <strong>Next Page</strong></a>’]

>>> xp(‘string(//a[1])’)  # converts it to string
[u’Click here to go to the Next Page’]

Beware of the difference between //node[1] and (//node)[1]//node[1] selects all the nodes occurring first under their respective parents and (//node)[1] selects all the nodes in the document, and then gets only the first of them.

from scrapy import Selector

html_code = “””<ul class=”list”>
<li>1</li>
<li>2</li>
<li>3</li>
</ul>

<ul class=”list”>
<li>4</li>
<li>5</li>
<li>6</li>
</ul>”””

sel = Selector(text=html_code)
xp = lambda x: sel.xpath(x).extract()

xp(“//li[1]”) # get all first LI elements under whatever it is its parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//li)[1]”) # get the first LI element in the whole document

[u'<li>1</li>’]

xp(“//ul/li[1]”)  # get all first LI elements under an UL parent

[u'<li>1</li>’, u'<li>4</li>’]

xp(“(//ul/li)[1]”) # get the first LI element under an UL parent in the document

[u'<li>1</li>’]

Also,

//a[starts-with(@href, ‘#’)][1] gets a collection of the local anchors that occur first under their respective parents and (//a[starts-with(@href, ‘#’)])[1] gets the first local anchor in the document.

When selecting by class, be as specific as necessary.

If you want to select elements by a CSS class, the XPath way to do the same job is the rather verbose:

*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ someclass ‘)]

Let’s cook up some examples:

>>> sel = Selector(text='<p class=”content-author”>Someone</p><p class=”content text-wrap”>Some content</p>’)

>>> xp = lambda x: sel.xpath(x).extract()

BAD: because there are multiple classes in the attribute

>>> xp(“//*[@class=’content’]”)

[]

BAD: gets more content than we need

 >>> xp(“//*[contains(@class,’content’)]”)

     [u'<p class=”content-author”>Someone</p>’,
     u'<p class=”content text-wrap”>Some content</p>’]

GOOD:

>>> xp(“//*[contains(concat(‘ ‘, normalize-space(@class), ‘ ‘), ‘ content ‘)]”)
[u'<p class=”content text-wrap”>Some content</p>’]

And many times, you can just use a CSS selector instead, and even combine the two of them if needed:

ALSO GOOD:

>>> sel.css(“.content”).extract()
[u'<p class=”content text-wrap”>Some content</p>’]

>>> sel.css(‘.content’).xpath(‘@class’).extract()
[u’content text-wrap’]

Learn to use all the different axes.

It is handy to know how to use the axes, you can follow through these examples.

In particular, you should note that following and following-sibling are not the same thing, this is a common source of confusion. The same goes for preceding and preceding-sibling, and also ancestor and parent.

Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents: 

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from the script and style tags and also skip whitespace-only text nodes.

Tools & Libraries Used:

Firefox
Firefox inspect element with firebug
Scrapy : 1.1.1
Python : 2.7.12
Requests : 2.11.0

 Have questions? Comment below. Please share if you found this helpful.

Source: http://blog.datahut.co/how-xpath-plays-vital-role-in-web-scraping-part-2/

Saturday, 5 November 2016

Tapping The Mining Services Goldmine

Tapping The Mining Services Goldmine

In Australia, resources booms tend to come and go. In a recent speech, Reserve Bank Deputy Governor Ric Battellino identified five major booms over the last two hundred years - from the gold rush of the 1850s, to our current minerals and energy boom.

Many have argued that the current boom is different from anything we've experienced before, with the modernisation of the Chinese and Indian economies likely to keep demand high for decades. That's led some analysts to talk of a resources supercycle. And yet a supercycle is still a cycle.

By definition, cycles are uneven, with commodity prices ebbing and flowing in response to demand, economic conditions and market sentiment. And the share prices of resources companies tend to move with them.

Which raises the question: what's the best way for investors to tap into the potential of the mining boom, without the heart-stopping volatility that mining stocks sometimes deliver?
Invest in the store that sells the spade

Legend has it that the people who really profited from Australia's gold rush weren't the miners who flocked to the fields, but the store-owners who sold them their spades and pans. You can put the same principle to work today by investing in mining services and engineering companies.

Here are five reasons to consider giving mining services companies a place in your portfolio:

1. Growing demand

In November, the Australian Bureau of Agricultural and Resource Economics reported that mining and energy companies plan to invest a record $132.9bn in new projects, a 58% increase from the previous year. That includes 72 projects at an advanced stage of development, such as the $43bn Gorgon LNG project and the $20bn Olympic dam expansion. The mining services sector is poised to benefit from all of them.

The sector also stands to benefit from Australia's worsening skills shortage, with more companies looking to contractors to provide essential services in remote locations.

2. Less volatility

Resource stocks tend to fluctuate with commodity prices, which are subject to international economic forces and market sentiment beyond the control of any individual company. As a result, they are among the most volatile companies on the Australian sharemarket. But mining services stocks, while still exposed to the commodities cycle, tend to be more stable.

3. More predictable cash flow

One reason for the comparative volatility of commodity companies is that their cash flow can be very variable. In the development phase, they need to make significant capital expenditure, often leading to negative cash flows. And while they enjoy healthy revenues in the production phase, that revenue may diminish as a resource is exhausted, unless they make further investments in exploration and development.
In contrast, mining services companies require comparatively little capital investment, with more predictable cash flows over the long-term.

4. Higher dividends

Predictable cash flows and lower capital expenditures often allow services companies to pay out more of their earnings as dividends, making them more appealing for income-oriented investors.

5. No need to pick winners

Many miners are highly leveraged to demand for a single commodity, whether it's gold, coal, copper or iron ore. Some are reliant on a single mine or field. Whereas services companies generally have a more diversified customer base.

Source: http://ezinearticles.com/?Tapping-The-Mining-Services-Goldmine&id=5924837

Thursday, 20 October 2016

Web Scraping with Python: A Beginner’s Guide

Web Scraping with Python: A Beginner’s Guide

In the Big Data world, Web Scraping or Data extraction services are the primary requisites for Big Data Analytics. Pulling up data from the web has become almost inevitable for companies to stay in business. Next question that comes up is how to go about web scraping as a beginner.

Data can be extracted or scraped from a web source using a number of methods. Popular websites like Google, Facebook, or Twitter offer APIs to view and extract the available data in a structured manner.  This prevents the use of other methods that may not be preferred by the API provider. However, the demand to scrape a website arises when the information is not readily offered by the website. Python, an open source programming language is often used for Web Scraping due to its simple and rich ecosystem. It contains a library called “BeautifulSoup” which carries on this task. Let’s take a deeper look into web scraping using python.

Setting up a Python Environment:

To carry out web scraping using Python, you will first have to install the Python Environment, which enables to run code written in the python language. The libraries perform data scraping;

Beautiful Soup is a convenient-to-use python library. It is one of the finest tools for extracting information from a webpage. Professionals can scrape information from web pages in the form of tables, lists, or paragraphs. Urllib2 is another library that can be used in combination with the BeautifulSoup library for fetching the web pages. Filters can be added to extract specific information from web pages. Urllib2 is a Python module that can fetch URLs.

For MAC OSX :

To install Python libraries on MAC OSX, users need to open a terminal win and type in the following commands, single command at a time:

sudoeasy_install pip

pip install BeautifulSoup4

pip install lxml

For Windows 7 & 8 users:

Windows 7 & 8 users need to ensure that the python environment gets installed first. Once, the environment is installed, open the command prompt and find the way to root C:/ directory and type in the following commands:

easy_install BeautifulSoup4

easy_installlxml

Once the libraries are installed, it is time to write data scraping code.

Running Python:

Data scraping must be done for a distinct objective such as to scrape current stock of a retail store. First, a web browser is required to navigate the website that contains this data. After identifying the table, right click anywhere on it and then select inspect element from the dropdown menu list. This will cause a window to pop-up on the bottom or side of your screen displaying the website’s html code. The rankings appear in a table. You might need to scan through the HTML data until you find the line of code that highlights the table on the webpage.

Python offers some other alternatives for HTML scraping apart from BeautifulSoup. They include:

    Scrapy
    Scrapemark
    Mechanize

 Web scraping converts unstructured data from HTML code into structured form such as tabular data in an Excel worksheet. Web scraping can be done in many ways ranging from the use of Google Docs to programming languages. For people who do not have any programming knowledge or technical competencies, it is possible to acquire web data by using web scraping services that provide ready to use data from websites of your preference.

HTML Tags:

To perform web scraping, users must have a sound knowledge of HTML tags. It might help a lot to know that HTML links are defined using anchor tag i.e. <a> tag, “<a href=“http://…”>The link needs to be here </a>”. An HTML list comprises <ul> (unordered) and <ol> (ordered) list. The item of list starts with <li>.

HTML tables are defined with<Table>, row as <tr> and columns are divided into data as <td>;

    <!DOCTYPE html> : A HTML document starts with a document type declaration
    The main part of the HTML document in unformatted, plain text is defined by <body> and </body> tags
    The headings in HTML are defined using the heading tags from <h1> to <h5>
    Paragraphs are defined with the <p> tag in HTML
    An entire HTML document is contained between <html> and </html>

Using BeautifulSoup in Scraping:

While scraping a webpage using BeautifulSoup, the main concern is to identify the final objective. For instance, if you would like to extract a list from webpage, a step wise approach is required:

    First and foremost step is to import the required libraries:

 #import the library used to query a website

import urllib2

#specify the url wiki = “https://”

#Query the website and return the html to the variable ‘page’

page = urllib2.urlopen(wiki)

#import the Beautiful soup functions to parse the data returned from the website

from bs4 import BeautifulSoup

#Parse the html in the ‘page’ variable, and store it in Beautiful Soup format

soup = BeautifulSoup(page)

    Use function “prettify” to visualize nested structure of HTML page
    Working with Soup tags:

Soup<tag> is used for returning content between opening and closing tag including tag.

    In[30]:soup.title

 Out[30]:<title>List of Presidents in India till 2010 – Wikipedia, the free encyclopedia</title>

    soup.<tag>.string: Return string within given tag
    In [38]:soup.title.string
    Out[38]:u ‘List of Presidents in India and Brazil till 2010 in India – Wikipedia, the free encyclopedia’
    Find all the links within page’s <a> tags: Tag a link using tag “<a>”. So, go with option soup.a and it should return the links available in the web page. Let’s do it.
    In [40]:soup.a

Out[40]:<a id=”top”></a>

    Find the right table:

As a table to pull up information about Presidents in India and Brazil till 2010 is being searched for, identifying the right table first is important. Here’s a command to scrape information enclosed in all table tags.

all_tables= soup.find_all(‘table’)

Identify the right table by using attribute “class” of table needs to filter the right table. Thereafter, inspect the class name by right clicking on the required table of web page as follows:

    Inspect element
    Copy the class name or find the class name of right table from the last command’s output.

 right_table=soup.find(‘table’, class_=’wikitable sortable plainrowheaders’)

right_table

That’s how we can identify the right table.

    Extract the information to DataFrame: There is a need to iterate through each row (tr) and then assign each element of tr (td) to a variable and add it to a list. Let’s analyse the Table’s HTML structure of the table. (extract information for table heading <th>)

To access value of each element, there is a need to use “find(text=True)” option with each element.  Finally, there is data in dataframe.

There are various other ways to scrape data using “BeautifulSoup” that reduce manual efforts to collect data from web pages. Code written in BeautifulSoup is considered to be more robust than the regular expressions. The web scraping method we discussed use “BeautifulSoup” and “urllib2” libraries in Python. That was a brief beginner’s guide to start using Python for web scraping.

Source: https://www.promptcloud.com/blog/web-scraping-python-guide

Saturday, 1 October 2016

Scraping Yelp Business Data With Python Scraping Script

Scraping Yelp Business Data With Python Scraping Script

Yelp is a great source of business contact information with details like address, postal code, contact information; website addresses etc. that other site like Google Maps just does not. Yelp also provides reviews about the particular business. The yelp business database can be useful for telemarketing, email marketing and lead generation.

Are you looking for yelp business details database? Are you looking for scraping data from yelp website/business directory? Are you looking for yelp screen scraping software? Are you looking for scraping the business contact information from the online Yelp? Then you are at the right place.

Here I am going to discuss how to scrape yelp data for lead generation and email marketing. I have made a simple and straight forward yelp data scraping script in python that can scrape data from yelp website. You can use this yelp scraper script absolutely free.

I have used urllib, BeautifulSoup packages. Urllib package to make http request and parsed the HTML using BeautifulSoup, used Threads to make the scraping faster.
Yelp Scraping Python Script

import urllib
from bs4 import BeautifulSoup
import re
from threading import Thread

#List of yelp urls to scrape
url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg']

i=0
#function that will do actual scraping job
def scrape(ur):

          html = urllib.urlopen(ur).read()
          soup = BeautifulSoup(html)

      title = soup.find('h1',itemprop="name")
          saddress = soup.find('span',itemprop="streetAddress")
          postalcode = soup.find('span',itemprop="postalCode")
          print title.text
          print saddress.text
          print postalcode.text
          print "-------------------"

threadlist = []

#making threads
while i<len(url):
          t = Thread(target=scrape,args=(url[i],))
          t.start()
          threadlist.append(t)
          i=i+1

for b in threadlist:
          b.join()

Recently I had worked for one German company and did yelp scraping project for them and delivered data as per their requirement. If you looking for scraping data from business directories like yelp then send me your requirement and I will get back to you with sample.

Source: http://webdata-scraping.com/scraping-yelp-business-data-python-scraping-script/

Tuesday, 20 September 2016

Powerful Web Scraping Software – Content Grabber Review

Powerful Web Scraping Software – Content Grabber Review

There are many web scraping software and cloud based web scraping services available in the market for extracting data from the websites. They vary widely in cost and features. In this article, I am going to introduce one such advanced web scraping tool “Content Grabber”, which is widely used and the best web scraping software in the market.

Content Grabber is used for web extraction, web scraping and web automation. It can extract content from complex websites and export it as structured data in a variety of formats like Excel Spreadsheets, XML, CSV and databases. Content Grabber can also extract data from highly dynamic websites. It can extract from AJAX-enabled websites, submit forms repeatedly to cover all possible input values, and manage website logins.

Content Grabber is designed to be reliable, scalable and customizable. It is specifically designed for users with a critical reliance on web scraping and web data extraction. It also enables you to make standalone web scraping agents which you can market and sell as your own royalty free web scraping software.

Applications of Content Grabber:

The following are the few applications of Content Grabber:

  •     Data aggregation – for example news aggregation.
  •     Competitive pricing and monitoring e.g. monitor dealers for price compliance.
  •     Financial and Market Research e.g. Make proactive buying and selling decisions by continuously receiving corporate operational data.
  •     Content Integration i.e. integration of data from various sources at one place.
  •     Business Directory Scraping – for example: yellow pages scraping, yelp scraping, superpages scraping etc.
  •     Extracting company data from yellow pages for scraping common data fields like Business Name, Address, Telephone, Fax, Email, Website and Category of Business.
  •     Extracting eBay auction data like: eBay Product Name, Store Information, Buy it Now prices, Product Price, List Price, Seller Price and many more.
  •     Extracting Amazon product data: Information such as Product title, cost, description, details, availability, shipping info, ASIN, rating, rank, etc can be extracted.

Content Grabber Features:

The following section highlights some of the key features of Content Grabber:

1. Point and Click Interface

The Content Grabber editor has an easy to use point and click interface that provides easy point and click configuration. One simply needs to click on web elements to configure website navigation and content capture.

2. Easy to Use

The Content Grabber point and click interface is so simple to use that it can easily be used by beginners and non-programmers. There is certain built in facilities that automatically detect and configure all commands. It will automatically create a list of links, lists of content, manage pagination, handle web pages, download or upload files and capture any action you perform on a web page. You can also manually configure the agent commands, so Content Grabber gives you both simplicity and control.

3. Reliable and Scalable

Content Grabber’s powerful features like testing and debugging, solid error handling and error recovery, allows agent to run in the most difficult scenarios. It easily handles and scrapes dynamic websites built with JavaScript and AJAX. Content Grabber’s Intelligent agents don’t break with most site structure changes. These features enable us to build reliable web scraping agents. There are various configurations and performance tuning options that makes Content Grabber scalable. You can build as many web scraping agents as you want with Content Grabber.

4. High Performance

Multi-threading is used to increase the performance in Content Grabber. Content Grabber uses optimized web browsers. It uses static browsers for static web pages and dynamic browsers for dynamic web pages. It has an ultra-fast HTML5 parser for ultra-fast web scraping. One can use many web browsers concurrently to boost performance.

5. Debugging, Logging and Error Handling

Content Grabber has robust support for debugging, error handling and logging. Using a debugger, you can test and debug the web scraping agents which helps you to build reliable and error free web scraping solutions because most of the issues are addressed at design time. Content Grabber allows agent logging with three detail levels: Log URLs, Log raw HTML, Log to database or file. Logs can be useful to identify problems that occurred during execution of a web scraping agent. Content Grabber supports automatic error handling and custom error handling through scripting. Error status reports can also be mailed to administrators.

6. Scripting

Content Grabber comes with a built in script editor with IntelliSense that one can use in case of some unusual requirements or to fine tune some process. Scripting can be used to control agent behaviour, content transformation, customize data export and delivery and to generate data inputs for agent.

7. Unlimited Web Scraping Agents

Content Grabber allows building an unlimited number of Self-Contained Web Scraping Agents. Self-Contained agents are a standalone executable that can be run independently, branded as your own and distributed royalty free. Content Grabber provides an easy to use and effective GUI to manage all the agents. One can view status and logs of all the agents or run and schedule the agents in one centralized location.

8. Automation

Require data on a schedule? Weekly? Everyday? Each hour? Content Grabber allows automating and publishing extracted data. Configure Content Grabber by telling what data you want once, and then schedule it to run automatically.

And much more

There are too many features that Content Grabber provides, but here are a few more that may be useful and interest you.

  •     Schedule agents
  •     Manage proxies
  •     Custom notification criteria and messages
  •     Email notifications
  •     Handle websites logins
  •     Capture Screenshots of web elements or entire web page or save as PDF.
  •     Capture hidden content on web page.
  •     Crawl entire website
  •     Input data from almost any data source.
  •     Auto scroll to load dynamic data
  •     Handle complex JAVASCRIPT and AJAX actions
  •     XPATH support
  •     Convert Images to Text
  •     CAPTCHA handling
  •     Extract data from non-HTML documents like PDF and Word Documents
  •     Multi-threading and multiple web browsers
  •     Run agent from command line.

The above features come with the Professional edition license. Content Grabber’s Premium edition license is available with the following extra features:

1. Visual Studio 2013 integration

One can integrate Content Grabber to Visual Studio and take advantages of extra powerful script editing, debugging, and unit testing.

2. Remove Content Grabber branding

One can remove Content Grabber branding from the Content Grabber agents and distribute the executable.

3. Custom Design Templates

One can customize the Content Grabber agent user interface design with custom HTML templates – e.g. add your own company branding.

4. Royalty free distribution

One can distribute the Content Grabber agent to anybody without paying royalty fees and can run agents from the command line anywhere.

5. Programming Interface

Programming interfaces like Desktop API, Web API and windows service for building and editing agents.

6. Custom Web Scraping Application Development:

Content Grabber provides API and Visual Studio Integration which developer can use to build custom web scraping applications. It provides full control of the user interface and export functionality. One can develop both Desktop as well as Web based custom web scraping applications using the Content Grabber programming interface. It is a great tool and provides opportunity for developers to build general web scraping applications and sell those to generate revenue.

Are you looking for web scraping services? Do you need any assistance related to Content Grabber? We can probably help you to achieve your scraping-based project goals. We would be more than happy to hear from you.

Source: http://webdata-scraping.com/powerful-web-scraping-software-content-grabber/

Thursday, 8 September 2016

How to Use Microsoft Excel as a Web Scraping Tool

How to Use Microsoft Excel as a Web Scraping Tool

Microsoft Excel is undoubtedly one of the most powerful tools to manage information in a structured form. The immense popularity of Excel is not without reasons. It is like the Swiss army knife of data with its great features and capabilities. Here is how Excel can be used as a basic web scraping tool to extract web data directly into a worksheet. We will be using Excel web queries to make this happen.

Web queries is a feature of Excel which is basically used to fetch data on a web page into the Excel worksheet easily. It can automatically find tables on the webpage and would let you pick the particular table you need data from. Web queries can also be handy in situations where an ODBC connection is impossible to maintain apart from just extracting data from web pages. Let’s see how web queries work and how you can scrape HTML tables off the web using them.
Getting started

We’ll start with a simple Web query to scrape data from the Yahoo! Finance page. This page is particularly easier to scrape and hence is a good fit for learning the method. The page is also pretty straightforward and doesn’t have important information in the form of links or images. Here is the URL we will be using for the tutorial:

http://finance.yahoo.com/q/hp?s=GOOG

To create a new Web query:

1. Select the cell in which you want the data to appear.
2. Click on Data-> From Web
3. The New Web query box will pop up as shown below.

4. Enter the web page URL you need to extract data from in the Address bar and hit the Go button.
5. Click on the yellow-black buttons next to the table you need to extract data from.

6. After selecting the required tables, click on the Import button and you’re done. Excel will now start downloading the content of the selected tables into your worksheet.

Once you have the data scraped into your Excel worksheet, you can do a host of things like creating charts, sorting, formatting etc. to better understand or present the data in a simpler way.
Customizing the query

Once you have created a web query, you have the option to customize it according to your requirements. To do this, access Web query properties by right clicking on a cell with the extracted data. The page you were querying appears again, click on the Options button to the right of the address bar. A new pop up box will be displayed where you can customize how the web query interacts with the target page. The options here lets you change some of the basic things related to web pages like the formatting and redirections.

Apart from this, you can also alter the data range options by right clicking on a random cell with the query results and selecting Data range properties. The data range properties dialog box will pop up where you can make the required changes. You might want to rename the data range to something you can easily recognize like ‘Stock Prices’.

Auto refresh

Auto-refresh is a feature of web queries worth mentioning, and one which makes our Excel web scraper truly powerful. You can make the extracted data to be auto-refreshing so that your Excel worksheet will update the data whenever the source website changes. You can set how often you need the data to be updated from the source web page in data range options menu. The auto refresh feature can be enabled by ticking the box beside ‘Refresh every’ and setting your preferred time interval for updating the data.
Web scraping at scale

Although extracting data using Excel can be a great way to scrape html tables from the web, it is nowhere close to a real web scraping solution. This can prove to be useful if you are collecting data for your college research paper or you are a hobbyist looking for a cheap way to get your hands on some data. If data for business is your need, you will definitely have to depend on a web scraping provider with expertise in dealing with web scraping at scale. Outsourcing the complicated process that web scraping will also give you more room to deal with other things that need extra attention such as marketing your business.

Source: https://www.promptcloud.com/blog/how-to-use-excel-to-scrape-websites

Tuesday, 30 August 2016

How to use Social Media Scraping to be your Competitors’ Nightmare

How to use Social Media Scraping to be your Competitors’ Nightmare

Big data and competitive intelligence have been in the limelight for quite some time now. The almost magical power of big data to help a company make just the right decisions have been talked about a lot. When it comes to big data, the kind of benefits that a business can get totally depends upon the sources they acquire it from. Social media is one of the best sources from where you can get data that helps your business in a multitude of ways. Now that every business is deep rooted on the internet, social media data becomes all the more relevant and crucial. Here is how you can use data scraped from social media sites to get an edge in the competition.

Keeping watch on your competitors

Social media is the best place to watch your competitors’ activity and take counter initiatives to keep up or take over them. If you want to know what your competitors are up to, a social media scraping setup for scraping the posts that mention your competitors’ brand/product names can do the trick. This can also be used to learn a thing or two from their activities on social media so that you can take respective measures to stay ahead of them. For example, you could know if your competitor is running a special promotional offer at the moment and come up with something better than theirs to keep up. This can do wonders if you are in a highly competitive industry like Ecommerce where the competition is intense. If you are not using some help from web scraping technology to keep a close watch on your competitors, you could easily get left over in this fast-paced business scene.

Solving customer issues at the earliest

Customers are vocal about their experience with different products and services on social media sites these days. If you have a customer whose issue was left unsolved, there is a good chance that he/she will take it to the social media to vent the frustration. Watching out for such instances and giving them prompt support should be something you should do if you want to retain these customers and stop them from ruining your brand’s image. By scraping social media sites for posts that mention your product/service, you can easily find out if there are such grievances from customers. This can make sure to an extent that you don’t let unhappy customers stay that way, which eventually hurts your business in the long run. Customers can make or break your company, so using social media scraping to serve the customers better can help you succeed eventually.

Sentiment analysis

Social media data can play a good job at helping you understand user sentiments. With the help of social media scraping, a business can get the big picture about general perception of their brand by their users. This can go a long way since this level of feedback can help you fix unnoticed issues with your company and service quickly. By rectifying them, you can make your brand more appealing to the customers. Sentiment analysis will provide you with the opportunity to transform your business into how customers want it to be. Social media scraping is the one and only way to have access to this user sentiment data which can help you optimize your business for the customers.

Web crawling for social media data

When social media data possess so much value to businesses, it makes sense to look for efficient ways to gather and use this data. Manually scrolling through millions of tweets doesn’t make sense, this is why you should use social media scraping to aggregate the relevant data for your business. Besides, web scraping technologies make it possible to handle huge amounts of data with ease. Since the size of data is huge when it comes to business related requirements, web scraping is the only scalable solution worth considering. To make things even simpler, there are reliable web scraping solutions that offer social media scraping services for brand monitoring.

Bottom line

Since social media has become an integral part of online businesses, the data available on these sites possess immense value to companies in every industry. Social media scraping can be used for brand monitoring and gaining competitive intelligence that can be used to optimize your business model for maximum effectiveness. This will in turn make your company stand out from the competition and the added advantage of insights gained from social media data will help you to take over your competitors.

Source: https://www.promptcloud.com/blog/social-media-scraping-for-competitive-intelligence

Monday, 22 August 2016

Three Common Methods For Web Data Extraction

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Wednesday, 10 August 2016

Getting Data from the Web

Getting Data from the Web

You’ve tried everything else, and you haven’t managed to get your hands on the data you want. You’ve found the data on the web, but, alas — no download options are available and copy-paste has failed you. Fear not, there may still be a way to get the data out. For example you can:

Get data from web-based APIs, such as interfaces provided by online databases and many modern web applications (including Twitter, Facebook and many others). This is a fantastic way to access government or commercial data, as well as data from social media sites.

Extract data from PDFs. This is very difficult, as PDF is a language for printers and does not retain much information on the structure of the data that is displayed within a document. Extracting information from PDFs is beyond the scope of this book, but there are some tools and tutorials that may help you do it.

Screen scrape web sites. During screen scraping, you’re extracting structured content from a normal web page with the help of a scraping utility or by writing a small piece of code. While this method is very powerful and can be used in many places, it requires a bit of understanding about how the web works.

With all those great technical options, don’t forget the simple options: often it is worth to spend some time searching for a file with machine-readable data or to call the institution which is holding the data you want.

In this chapter we walk through a very basic example of scraping data from an HTML web page.
What is machine-readable data?

The goal for most of these methods is to get access to machine-readable data. Machine readable data is created for processing by a computer, instead of the presentation to a human user. The structure of such data relates to contained information, and not the way it is displayed eventually. Examples of easily machine-readable formats include CSV, XML, JSON and Excel files, while formats like Word documents, HTML pages and PDF files are more concerned with the visual layout of the information. PDF for example is a language which talks directly to your printer, it’s concerned with position of lines and dots on a page, rather than distinguishable characters.
Scraping web sites: what for?

Everyone has done this: you go to a web site, see an interesting table and try to copy it over to Excel so you can add some numbers up or store it for later. Yet this often does not really work, or the information you want is spread across a large number of web sites. Copying by hand can quickly become very tedious, so it makes sense to use a bit of code to do it.

The advantage of scraping is that you can do it with virtually any web site — from weather forecasts to government spending, even if that site does not have an API for raw data access.
What you can and cannot scrape

There are, of course, limits to what can be scraped. Some factors that make it harder to scrape a site include:

Badly formatted HTML code with little or no structural information e.g. older government websites.

Authentication systems that are supposed to prevent automatic access e.g. CAPTCHA codes and paywalls.

Session-based systems that use browser cookies to keep track of what the user has been doing.

A lack of complete item listings and possibilities for wildcard search.

Blocking of bulk access by the server administrators.

Another set of limitations are legal barriers: some countries recognize database rights, which may limit your right to re-use information that has been published online. Sometimes, you can choose to ignore the license and do it anyway — depending on your jurisdiction, you may have special rights as a journalist. Scraping freely available Government data should be fine, but you may wish to double check before you publish. Commercial organizations — and certain NGOs — react with less tolerance and may try to claim that you’re “sabotaging” their systems. Other information may infringe the privacy of individuals and thereby violate data privacy laws or professional ethics.
Tools that help you scrape

There are many programs that can be used to extract bulk information from a web site, including browser extensions and some web services. Depending on your browser, tools like Readability (which helps extract text from a page) or DownThemAll (which allows you to download many files at once) will help you automate some tedious tasks, while Chrome’s Scraper extension was explicitly built to extract tables from web sites. Developer extensions like FireBug (for Firefox, the same thing is already included in Chrome, Safari and IE) let you track exactly how a web site is structured and what communications happen between your browser and the server.

ScraperWiki is a web site that allows you to code scrapers in a number of different programming languages, including Python, Ruby and PHP. If you want to get started with scraping without the hassle of setting up a programming environment on your computer, this is the way to go. Other web services, such as Google Spreadsheets and Yahoo! Pipes also allow you to perform some extraction from other web sites.
How does a web scraper work?

Web scrapers are usually small pieces of code written in a programming language such as Python, Ruby or PHP. Choosing the right language is largely a question of which community you have access to: if there is someone in your newsroom or city already working with one of these languages, then it makes sense to adopt the same language.

While some of the click-and-point scraping tools mentioned before may be helpful to get started, the real complexity involved in scraping a web site is in addressing the right pages and the right elements within these pages to extract the desired information. These tasks aren’t about programming, but understanding the structure of the web site and database.

When displaying a web site, your browser will almost always make use of two technologies: HTTP is a way for it to communicate with the server and to request specific resource, such as documents, images or videos. HTML is the language in which web sites are composed.
The anatomy of a web page

Any HTML page is structured as a hierarchy of boxes (which are defined by HTML “tags”). A large box will contain many smaller ones — for example a table that has many smaller divisions: rows and cells. There are many types of tags that perform different functions — some produce boxes, others tables, images or links. Tags can also have additional properties (e.g. they can be unique identifiers) and can belong to groups called ‘classes’, which makes it possible to target and capture individual elements within a document. Selecting the appropriate elements this way and extracting their content is the key to writing a scraper.

Viewing the elements in a web page: everything can be broken up into boxes within boxes.

To scrape web pages, you’ll need to learn a bit about the different types of elements that can be in an HTML document. For example, the <table> element wraps a whole table, which has <tr> (table row) elements for its rows, which in turn contain <td> (table data) for each cell. The most common element type you will encounter is <div>, which can basically mean any block of content. The easiest way to get a feel for these elements is by using the developer toolbar in your browser: they will allow you to hover over any part of a web page and see what the underlying code is.

Tags work like book ends, marking the start and the end of a unit. For example <em> signifies the start of an italicized or emphasized piece of text and </em> signifies the end of that section. Easy.

An example: scraping nuclear incidents with Python

NEWS is the International Atomic Energy Agency’s (IAEA) portal on world-wide radiation incidents (and a strong contender for membership in the Weird Title Club!). The web page lists incidents in a simple, blog-like site that can be easily scraped.

To start, create a new Python scraper on ScraperWiki and you will be presented with a text area that is mostly empty, except for some scaffolding code. In another browser window, open the IAEA site and open the developer toolbar in your browser. In the “Elements” view, try to find the HTML element for one of the news item titles. Your browser’s developer toolbar helps you connect elements on the web page with the underlying HTML code.

Investigating this page will reveal that the titles are <h4> elements within a <table>. Each event is a <tr> row, which also contains a description and a date. If we want to extract the titles of all events, we should find a way to select each row in the table sequentially, while fetching all the text within the title elements.

In order to turn this process into code, we need to make ourselves aware of all the steps involved. To get a feeling for the kind of steps required, let’s play a simple game: In your ScraperWiki window, try to write up individual instructions for yourself, for each thing you are going to do while writing this scraper, like steps in a recipe (prefix each line with a hash sign to tell Python that this not real computer code). For example:

  # Look for all rows in the table
  # Unicorn must not overflow on left side.

Try to be as precise as you can and don’t assume that the program knows anything about the page you’re attempting to scrape.

Once you’ve written down some pseudo-code, let’s compare this to the essential code for our first scraper:

  import scraperwiki
  from lxml import html

In this first section, we’re importing existing functionality from libraries — snippets of pre-written code. scraperwiki will give us the ability to download web sites, while lxml is a tool for the structured analysis of HTML documents. Good news: if you are writing a Python scraper with ScraperWiki, these two lines will always be the same.

  url = "http://www-news.iaea.org/EventList.aspx"
  doc_text = scraperwiki.scrape(url)
  doc = html.fromstring(doc_text)

Next, the code makes a name (variable): url, and assigns the URL of the IAEA page as its value. This tells the scraper that this thing exists and we want to pay attention to it. Note that the URL itself is in quotes as it is not part of the program code but a string, a sequence of characters.

We then use the url variable as input to a function, scraperwiki.scrape. A function will provide some defined job — in this case it’ll download a web page. When it’s finished, it’ll assign its output to another variable, doc_text. doc_text will now hold the actual text of the website — not the visual form you see in your browser, but the source code, including all the tags. Since this form is not very easy to parse, we’ll use another function, html.fromstring, to generate a special representation where we can easily address elements, the so-called document object model (DOM).

  for row in doc.cssselect("#tblEvents tr"):
  link_in_header = row.cssselect("h4 a").pop()
  event_title = link_in_header.text
  print event_title

In this final step, we use the DOM to find each row in our table and extract the event’s title from its header. Two new concepts are used: the for loop and element selection (.cssselect). The for loop essentially does what its name implies; it will traverse a list of items, assigning each a temporary alias (row in this case) and then run any indented instructions for each item.

The other new concept, element selection, is making use of a special language to find elements in the document. CSS selectors are normally used to add layout information to HTML elements and can be used to precisely pick an element out of a page. In this case (Line. 6) we’re selecting #tblEvents tr which will match each <tr> within the table element with the ID tblEvents (the hash simply signifies ID). Note that this will return a list of <tr> elements.

As can be seen on the next line (Line. 7), where we’re applying another selector to find any <a> (which is a hyperlink) within a <h4> (a title). Here we only want to look at a single element (there’s just one title per row), so we have to pop it off the top of the list returned by our selector with the .pop() function.

Note that some elements in the DOM contain actual text, i.e. text that is not part of any markup language, which we can access using the [element].text syntax seen on line 8. Finally, in line 9, we’re printing that text to the ScraperWiki console. If you hit run in your scraper, the smaller window should now start listing the event’s names from the IAEA web site.

  figs/incoming/04-DD.png
  Figure 58. A scraper in action (ScraperWiki)

You can now see a basic scraper operating: it downloads the web page, transforms it into the DOM form and then allows you to pick and extract certain content. Given this skeleton, you can try and solve some of the remaining problems using the ScraperWiki and Python documentation:

Can you find the address for the link in each event’s title?

Can you select the small box that contains the date and place by using its CSS class name and extract the element’s text?

ScraperWiki offers a small database to each scraper so you can store the results; copy the relevant example from their docs and adapt it so it will save the event titles, links and dates.

The event list has many pages; can you scrape multiple pages to get historic events as well?

As you’re trying to solve these challenges, have a look around ScraperWiki: there are many useful examples in the existing scrapers — and quite often, the data is pretty exciting, too. This way, you don’t need to start off your scraper from scratch: just choose one that is similar, fork it and adapt to your problem.

Source: http://datajournalismhandbook.org/1.0/en/getting_data_3.html

Thursday, 4 August 2016

Three Common Methods For Web Data Extraction

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Monday, 1 August 2016

Tips for scraping business directories

Tips for scraping business directories

Are you looking to scrape business directories to generate leads?

Here are a few tips for scraping business directories.

Web scraping is not rocket science. But there are good and bad and worst ways of doing it.

Generating sales qualified leads is always a headache. The old school ways are to buy a list from sites like Data.com. But they are quite expensive.

Scraping business directories can help generate sales qualified leads. The following tips can help you scrape data from business directories efficiently.

1) Choose a good framework to write the web scrapers. This can help save a lot of time and trouble. Python Scrapy is our favourite, but there are other non-pythonic frameworks too.

2) The business directories might be having anti-scraping mechanisms. You have to use IP rotating services to do the scrape. Using IP rotating services, crawl with multiple changing IP addresses which can cover your tracks.

3) Some sites really don’t want you to scrape and they will block the bot. In these cases, you may need to disguise your web scraper as a human being. Browser automation tools like selenium can help you do this.

4) Web sites will update their data quite often. The scraper bot should be able to update the data according to the changes. This is a hard task and you need professional services to do that.

One of the easiest ways to generate leads is to scrape from business directories and use enrich them. We made Leadintel for lead research and enrichment.

Source: http://blog.datahut.co/tips-for-scraping-business-directories/

Tuesday, 12 July 2016

Python 3 web-scraping examples with public data

Someone on the NICAR-L listserv asked for advice on the best Python libraries for web scraping. My advice below includes what I did for last spring’s Computational Journalism class, specifically, the Search-Script-Scrape project, which involved 101-web-scraping exercises in Python.

Best Python libraries for web scraping

For the remainder of this post, I assume you’re using Python 3.x, though the code examples will be virtually the same for 2.x. For my class last year, I had everyone install the Anaconda Python distribution, which comes with all the libraries needed to complete the Search-Script-Scrape exercises, including the ones mentioned specifically below:
The best package for general web requests, such as downloading a file or submitting a POST request to a form, is the simply-named requests library (“HTTP for Humans”).

Here’s an overly verbose example:

import requests
base_url = 'http://maps.googleapis.com/maps/api/geocode/json'
my_params = {'address': '100 Broadway, New York, NY, U.S.A',
             'language': 'ca'}
response = requests.get(base_url, params = my_params)
results = response.json()['results']
x_geo = results[0]['geometry']['location']
print(x_geo['lng'], x_geo['lat'])
# -74.01110299999999 40.7079445

For the parsing of HTML and XML, Beautiful Soup 4 seems to be the most frequently recommended. I never got around to using it because it was malfunctioning on my particular installation of Anaconda on OS X.
But I’ve found lxml to be perfectly fine. I believe both lxml and bs4 have similar capabilities – you can even specify lxml to be the parser for bs4. I think bs4 might have a friendlier syntax, but again, I don’t know, as I’ve gotten by with lxml just fine:

import requests
from lxml import html
page = requests.get("http://www.example.com").text
doc = html.fromstring(page)
link = doc.cssselect("a")[0]
print(link.text_content())
# More information...
print(link.attrib['href'])
# http://www.iana.org/domains/example

The standard urllib package also has a lot of useful utilities – I frequently use the methods from urllib.parse. Python 2 also has urllib but the methods are arranged differently.

Here’s an example of using the urljoin method to resolve the relative links on the California state data for high school test scores. The use of os.path.basename is simply for saving the each spreadsheet to your local hard drive:

from os.path import basename
from urllib.parse import urljoin
from lxml import html
import requests
base_url = 'http://www.cde.ca.gov/ds/sp/ai/'
page = requests.get(base_url).text
doc = html.fromstring(page)
hrefs = [a.attrib['href'] for a in doc.cssselect('a')]
xls_hrefs = [href for href in hrefs if 'xls' in href]
for href in xls_hrefs:
  print(href) # e.g. documents/sat02.xls
  url = urljoin(base_url, href)
  with open("/tmp/" + basename(url), 'wb') as f:
    print("Downloading", url)
    # Downloading http://www.cde.ca.gov/ds/sp/ai/documents/sat02.xls
    data = requests.get(url).content
    f.write(data)

And that’s about all you need for the majority of web-scraping work – at least the part that involves reading HTML and downloading files.
Examples of sites to scrape

The 101 scraping exercises didn’t go so great, as I didn’t give enough specifics about what the exact answers should be (e.g. round the numbers? Use complete sentences?) or even where the data files actually were – as it so happens, not everyone Googles things the same way I do. And I should’ve made them do it on a weekly basis, rather than waiting till the end of the quarter to try to cram them in before finals week.

The Github repo lists each exercise with the solution code, the relevant URL, and the number of lines in the solution code.

The exercises run the gamut of simple parsing of static HTML, to inspecting AJAX-heavy sites in which knowledge of the network panel is required to discover the JSON files to grab. In many of these exercises, the HTML-parsing is the trivial part – just a few lines to parse the HTML to dynamically find the URL for the zip or Excel file to download (via requests)…and then 40 to 50 lines of unzipping/reading/filtering to get the answer. That part is beyond what typically considered “web-scraping” and falls more into “data wrangling”.

I didn’t sort the exercises on the list by difficulty, and many of the solutions are not particulary great code. Sometimes I wrote the solution as if I were teaching it to a beginner. But other times I solved the problem using the style in the most randomly bizarre way relative to how I would normally solve it – hey, writing 100+ scrapers gets boring.

But here are a few representative exercises with some explanation:
1. Number of datasets currently listed on data.gov

I think data.gov actually has an API, but this script relies on finding the easiest tag to grab from the front page and extracting the text, i.e. the 186,569 from the text string, "186,569 datasets found". This is obviously not a very robust script, as it will break when data.gov is redesigned. But it serves as a quick and easy HTML-parsing example.
29. Number of days until Texas’s next scheduled execution

Texas’s death penalty site is probably one of the best places to practice web scraping, as the HTML is pretty straightforward on the main landing pages (there are several, for scheduled and past executions, and current inmate roster), which have enough interesting tabular data to collect. But you can make it more complex by traversing the links to collect inmate data, mugshots, and final words. This script just finds the first person on the scheduled list and does some math to print the number of days until the execution (I probably made the datetime handling more convoluted than it needs to be in the provided solution)
3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days

The analytics.usa.gov site is a great place to practice AJAX-data scraping. It’s a very simple and robust site, but either you are aware of AJAX and know how to use the network panel (and in this case, locate ie.json, or you will have no clue how to scrape even a single number on this webpage. I think the difference between static HTML and AJAX sites is one of the tougher things to teach novices. But they pretty much have to learn the difference given how many of today’s websites use both static and dynamically-rendered pages.
6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees

There’s actually no HTML parsing if you assume the URLs for the data files can be hard coded. So besides the nominal use of the requests library, this ends up being a data-wrangling exercise: download two specific zip files, unzip them, read the CSV files, filter the dictionaries, then do some math.
90. The currently serving U.S. congressmember with the most Twitter followers

Another example with no HTML parsing, but probably the most complicated example. You have to download and parse Sunlight Foundation’s CSV of Congressmember data to get all the Twitter usernames. Then authenticate with Twitter’s API, then perform mulitple batch lookups to get the data for all 500+ of the Congressional Twitter usernames. Then join the sorted result with the actual Congressmember identity. I probably shouldn’t have assigned this one.
HTML is not necessary

I included no-HTML exercises because there are plenty of data programming exercises that don’t have to deal with the specific nitty-gritty of the Web, such as understanding HTTP and/or HTML. It’s not just that a lot of public data has moved to JSON (e.g. the FEC API) – but that much of the best public data is found in bulk CSV and database files. These files can be programmatically fetched with simple usage of the requests library.

It’s not that parsing HTML isn’t a whole boatload of fun – and being able to do so is a useful skill if you want to build websites. But I believe novices have more than enough to learn from in sorting/filtering dictionaries and lists without worrying about learning how a website works.

Besides analytics.usa.gov, the data.usajobs.gov API, which lists federal job openings, is a great one to explore, because its data structure is simple and the site is robust. Here’s a Python exercise with the USAJobs API; and here’s one in Bash.

There’s also the Google Maps geocoding API, which can be hit up for a bit before you run into rate limits, and you get the bonus of teaching geocoding concepts. The NYTimes API requires creating an account, but you not only get good APIs for some political data, but for content data (i.e. articles, bestselling books) that is interesting fodder for journalism-related analysis.

But if you want to scrape HTML, then the Texas death penalty pages are the way to go, because of the simplicity of the HTML and the numerous ways you can traverse the pages and collect interesting data points. Besides the previously mentioned Texas Python scraping exercise, here’s one for Florida’s list of executions. And here’s a Bash exercise that scrapes data from Texas, Florida, and California and does a simple demographic analysis.

If you want more interesting public datasets – most of which require only a minimal of HTML-parsing to fetch – check out the list I talked about in last week’s info session on Stanford’s Computational Journalism Lab.

Source URL :  http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/