Saturday, 11 July 2015

SFTW: Scraping data with Google Refine

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:     http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521

  tp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629

    ttp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823

In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.

There are a number of ways you could scrape this data. You could use Google Docs and the =importXML formula, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit > Paste Special > Values Only and then use the formula a further 50 times if it’s not too many – here’s one I prepared earlier).

And you could use Scraperwiki to write a powerful scraper – but you need to understand enough coding to do so quickly (here’s a demo I prepared earlier).

A middle option is to use Google Refine, and here’s how you do it.

Assembling the ingredients

With the basic URL structure identified, we already have half of our ingredients. What we need  next is a list of the ID codes that we’re going to use to complete each URL.

An advanced search for “list seed number scottish schools filetype:xls” brings up a link to this spreadsheet (XLS) which gives us just that.

The spreadsheet will need editing: remove any rows you don’t need. This will reduce the time that the scraper will take in going through them. For example, if you’re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.

Now to combine  the ID codes with the base URL.

Bringing your data into Google Refine

Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.

At the top of the school ID column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top call this ‘URL’.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

“http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=”+value

(Type in the quotation marks yourself – if you’re copying them from a webpage you may have problems)

The ‘value’ bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.

In the Preview window you should see the results – you can even copy one of the resulting URLs and paste it into a browser to check it works. (On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing ‘value’ to value.substring(0,7) – this extracts the first 7 characters of the ID number, omitting the ‘.0') UPDATE: in the comment Thad suggests “perhaps, upon import of your spreadsheet of IDs, you forgot to uncheck the importer option to Parse as numbers?”

Click OK if you’re happy, and you should have a new column with a URL for each school ID.

Grabbing the HTML for each page

Now click on the top of this new URL column and select Edit column > Add column by fetching URLs…

In the New column name box at the top call this ‘HTML’.

All you need in the Expression window is ‘value’, so leave that as it is.

Click OK.

Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time – hours, depending on the speed of your computer and internet connection (it may not work at all if either isn’t very fast). So leave it running and come back to it later.

Extracting data from the raw HTML with parseHTML

When it’s finished you’ll have another column where each cell is a bunch of HTML. You’ll need to create a new column to extract what you need from that, and you’ll also need some GREL expressions explained here.

First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like <table class=”destinations”> or <div id=”statistics”>. Keep that open in another window while you tweak the expression we come onto below…

Back in Google Refine, at the top of the HTML column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top give it a name describing the data you’re going to pull out.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

value.parseHtml().select(“table.destinations”)[0].select(“tr”).toString()

(Again, type the quotation marks yourself rather than copying them from here or you may have problems)

I’ll break down what this is doing:

value.parseHtml()

parse the HTML in each cell (value)

.select(“table.destinations”)

find a table with a class (.) of “destinations” (in the source HTML this reads <table class=”destinations”>. If it was <div id=”statistics”> then you would write .select(“div#statistics”) – the hash sign representing an ‘id’ and the full stop representing a ‘class’.

[0]

This zero in square brackets tells Refine to only grab the first table – a number 1 would indicate the second, and so on. This is because numbering (“indexing”) generally begins with zero in programming.

.select(“tr”)

Now, within that table, find anything within the tag <tr>

.toString()

And convert the results into a string of text.

The results of that expression in the Preview window should look something like this:

<tr> <th></th> <th>Abbotswell School</th> <th>Aberdeen City</th> <th>Scotland</th> </tr> <tr> <th>Percentage of pupils</th> <td>25.5%</td> <td>16.3%</td> <td>22.6%</td> </tr>

This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the =SPLIT formula, for example).

Or you could further tweak your GREL code in Refine to drill further into your data, like so:

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString()

Which would give you this:

<td>25.5%</td>

Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString().substring(5,10)

When you’re happy, click OK and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.

Then click Export in the upper right corner and save as a CSV or Excel file.

Source: http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/

Saturday, 27 June 2015

Data Scraping - Hand Scraped Hardwood Flooring Gives Your Home That Exclusive Look

Today hand scraped hardwood flooring is becoming extremely popular in the more opulent homes as well as in some commercial properties. Although this type of flooring has only recently become fashionable it has been around for many centuries.

Certainly before the invention of modern sanding techniques all floors where hand scraped at the location where they were to be installed to ensure that the floor would be flat and even. However today this method is used instead to provide texture, richness as well as a unique look and feel to the flooring.

Although manufacturers have produced machines which can provide a scraped look to their flooring it looks cheap compared to the real thing. Unfortunately the main problem with using a machine to scrape the flooring is that it provides a uniform look to the pattern of the wood. Because of this it lacks the natural feel that you would see with a floor which has been scraped by hand.

When done by hand, scraping creates a truly unique look to the floor. However the actual look and feel of each floor will vary as it depends on the skills of the person actually carrying out the work. If there is no control in place whilst the work is being carried out this can result in disastrous look to the finished product.

Many manufacturers who actually provide hand scraped hardwood flooring will either just dent, scoop or rough the floor up. But others will use sanding techniques in order to create a worn and uneven look to the flooring. The more professional teams will scrape the entire surface of the wood in order to create the unique hand made look for their customers.

Many companies will allow their customers to choose what type of scraping takes place on their wood. They can choose between light, medium and heavy. The companies who are really good at hand scraping will be able give the hardwood floor a reclaimed look by including wormholes, splits and other naturally-occurring features within the wood.

If you do decide to choose hand scraped hardwood flooring you will need to factor the costs that are associated with it into your budget. Unfortunately this type of flooring does not come cheap and you can find yourself paying upwards of $15 per sq ft. But once it is installed it will give a room a unique and warm rich feel to it and is certainly going to wow your friends and family when they see it for the first time.

Source: http://ezinearticles.com/?Hand-Scraped-Hardwood-Flooring-Gives-Your-Home-That-Exclusive-Look&id=572577

Monday, 22 June 2015

Making data on the web useful: scraping

Introduction

Many times data is not easily accessible – although it does exist. As much as we wish everything was available in CSV or the format of our choice – most data is published in different forms on the web. What if you want to use the data to combine it with other datasets and explore it independently?

Scraping to the rescue!

Scraping describes the method to extract data hidden in documents – such as Web Pages and PDFs and make it useable for further processing. It is among the most useful skills if you set out to investigate data – and most of the time it’s not especially challenging. For the most simple ways of scraping you don’t even need to know how to write code.

This example relies heavily on Google Chrome for the first part. Some things work well with other browsers, however we will be using one specific browser extension only available on Chrome. If you can’t install Chrome, don’t worry the principles remain similar.

Code-free Scraping in 5 minutes using Google Spreadsheets & Google Chrome

Knowing the structure of a website is the first step towards extracting and using the data. Let’s get our data into a spreadsheet – so we can use it further. An easy way to do this is provided by a special formula in Google Spreadsheets.

Save yourselves hours of time in copy-paste agony with the ImportHTML command in Google Spreadsheets. It really is magic!

Recipes

In order to complete the next challenge, take a look in the Handbook at one of the following recipes:

    Extracting data from HTML tables.

    Scraping using the Scraper Extension for Chrome

Both methods are useful for:

    Extracting individual lists or tables from single webpages

The latter can do slightly more complex tasks, such as extracting nested information. Take a look at the recipe for more details.

Neither will work for:

    Extracting data spread across multiple webpages

Challenge

Task: Find a website with a table and scrape the information from it. Share your result on datahub.io (make sure to tag your dataset with schoolofdata.org)

Tip

Once you’ve got your table into the spreadsheet, you may want to move it around, or put it in another sheet. Right click the top left cell and select “paste special” – “paste values only”.

Scraping more than one webpage: Scraperwiki

Note: Before proceeding into full scraping mode, it’s helpful to understand the flesh and bones of what makes up a webpage. Read the Introduction to HTML recipe in the handbook.

Until now we’ve only scraped data from a single webpage. What if there are more? Or you want to scrape complex databases? You’ll need to learn how to program – at least a bit.

It’s beyond the scope of this course to teach how to scrape, our aim here is to help you understand whether it is worth investing your time to learn, and to point you at some useful resources to help you on your way!

Structure of a scraper

Scrapers are comprised of three core parts:

1.    A queue of pages to scrape
2.    An area for structured data to be stored, such as a database
3.    A downloader and parser that adds URLs to the queue and/or structured information to the database.

Fortunately for you there is a good website for programming scrapers: ScraperWiki.com

ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.

If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/

When should I make the investment to learn how to scrape?

A few reasons (non-exhaustive list!):

1.    If you regularly have to extract data where there are numerous tables in one page.

2.    If your information is spread across numerous pages.

3.    If you want to run the scraper regularly (e.g. if information is released every week or month).

4.    If you want things like email alerts if information on a particular webpage changes.

…And you don’t want to pay someone else to do it for you!

Summary:

In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi-structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming – for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Further Reading

1.    Scraping for Journalism: A Guide for Collecting Data: ProPublica Guides

2.    Scraping for Journalists (ebook): Paul Bradshaw

3.    Scrape the Web: Strategies for programming websites that don’t expect it : Talk from PyCon

4.    An Introduction to Compassionate Screen Scraping: Will Larson

Any questions? Got stuck? Ask School of Data!

ScraperWiki has two main functions: You can write scrapers – which are optionally run regularly and the data is available to everyone visiting – or you can request them to write scrapers for you. The latter costs some money – however it helps to contact the Scraperwiki community (Google Group) someone might get excited about your project and help you!.

If you are interested in writing scrapers with Scraperwiki, check out this sample scraper – scraping some data about Parliament. Click View source to see the details. Also check out the Scraperwiki documentation: https://scraperwiki.com/docs/python/

When should I make the investment to learn how to scrape?

A few reasons (non-exhaustive list!):

1.    If you regularly have to extract data where there are numerous tables in one page.

2.    If your information is spread across numerous pages.

3.    If you want to run the scraper regularly (e.g. if information is released every week or month).

4.    If you want things like email alerts if information on a particular webpage changes.

…And you don’t want to pay someone else to do it for you!

Summary:

In this course we’ve covered Web scraping and how to extract data from websites. The main function of scraping is to convert data that is semi-structured into structured data and make it easily useable for further processing. While this is a relatively simple task with a bit of programming – for single webpages it is also feasible without any programming at all. We’ve introduced =importHTML and the Scraper extension for your scraping needs.

Source: http://schoolofdata.org/handbook/courses/scraping/

Saturday, 13 June 2015

Scraping Services - Assuring Scraping Success with Proxy Data Scraping

Have you ever heard of "Data Scraping?" Data Scraping is the process of collecting useful data that has been placed in the public domain of the internet (private areas too if conditions are met) and storing it in databases or spreadsheets for later use in various applications. Data Scraping technology is not new and many a successful businessman has made his fortune by taking advantage of data scraping technology.

Sometimes website owners may not derive much pleasure from automated harvesting of their data. Webmasters have learned to disallow web scrapers access to their websites by using tools or methods that block certain ip addresses from retrieving website content. Data scrapers are left with the choice to either target a different website, or to move the harvesting script from computer to computer using a different IP address each time and extract as much data as possible until all of the scraper's computers are eventually blocked.

Thankfully there is a modern solution to this problem. Proxy Data Scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program executes an extraction from a website, the website thinks it is coming from a different IP address. To the website owner, proxy data scraping simply looks like a short period of increased traffic from all around the world. They have very limited and tedious ways of blocking such a script but more importantly -- most of the time, they simply won't know they are being scraped.

You may now be asking yourself, "Where can I get Proxy Data Scraping Technology for my project?" The "do-it-yourself" solution is, rather unfortunately, not simple at all. Setting up a proxy data scraping network takes a lot of time and requires that you either own a bunch of IP addresses and suitable servers to be used as proxies, not to mention the IT guru you need to get everything configured properly. You could consider renting proxy servers from select hosting providers, but that option tends to be quite pricey but arguably better than the alternative: dangerous and unreliable (but free) public proxy servers.

There are literally thousands of free proxy servers located around the globe that are simple enough to use. The trick however is finding them. Many sites list hundreds of servers, but locating one that is working, open, and supports the type of protocols you need can be a lesson in persistence, trial, and error. However if you do succeed in discovering a pool of working public proxies, there are still inherent dangers of using them. First off, you don't know who the server belongs to or what activities are going on elsewhere on the server. Sending sensitive requests or data through a public proxy is a bad idea. It is fairly easy for a proxy server to capture any information you send through it or that it sends back to you. If you choose the public proxy method, make sure you never send any transaction through that might compromise you or anyone else in case disreputable people are made aware of the data.

A less risky scenario for proxy data scraping is to rent a rotating proxy connection that cycles through a large number of private IP addresses. There are several of these companies available that claim to delete all web traffic logs which allows you to anonymously harvest the web with minimal threat of reprisal. Companies such as offer large scale anonymous proxy solutions, but often carry a fairly hefty setup fee to get you going.

The other advantage is that companies who own such networks can often help you design and implementation of a custom proxy data scraping program instead of trying to work with a generic scraping bot. After performing a simple Google search, I quickly found one company (www.ScrapeGoat.com) that provides anonymous proxy server access for data scraping purposes. Or, according to their website, if you want to make your life even easier, ScrapeGoat can extract the data for you and deliver it in a variety of different formats often before you could even finish configuring your off the shelf data scraping program.

Whichever path you choose for your proxy data scraping needs, don't let a few simple tricks thwart you from accessing all the wonderful information stored on the world wide web!

Source: http://ezinearticles.com/?Assuring-Scraping-Success-with-Proxy-Data-Scraping&id=248993

Friday, 12 June 2015

Web Scraping : Data Mining vs Screen-Scraping

Data mining isn't screen-scraping. I know that some people in the room may disagree with that statement, but they're actually two almost completely different concepts.

In a nutshell, you might state it this way: screen-scraping allows you to get information, where data mining allows you to analyze information. That's a pretty big simplification, so I'll elaborate a bit.

The term "screen-scraping" comes from the old mainframe terminal days where people worked on computers with green and black screens containing only text. Screen-scraping was used to extract characters from the screens so that they could be analyzed. Fast-forwarding to the web world of today, screen-scraping now most commonly refers to extracting information from web sites. That is, computer programs can "crawl" or "spider" through web sites, pulling out data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the "practice of automatically searching large stores of data for patterns." In other words, you already have the data, and you're now analyzing it to learn useful things about it. Data mining often involves lots of complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what's already there.

The difficulty is that people who don't know the term "screen-scraping" will try Googling for anything that resembles it. We include a number of these terms on our web site to help such folks; for example, we created pages entitled Text Data Mining, Automated Data Collection, Web Site Data Extraction, and even Web Site Ripper (I suppose "scraping" is sort of like "ripping"). So it presents a bit of a problem-we don't necessarily want to perpetuate a misconception (i.e., screen-scraping = data mining), but we also have to use terminology that people will actually use.

Source: http://ezinearticles.com/?Data-Mining-vs-Screen-Scraping&id=146813

Wednesday, 3 June 2015

Getting Data from the Web Scraping

You’ve tried everything else, and you haven’t managed to get your hands on the data you want. You’ve found the data on the web, but, alas — no download options are available and copy-paste has failed you. Fear not, there may still be a way to get the data out. For example you can:

•    Get data from web-based APIs, such as interfaces provided by online databases and many modern web applications (including Twitter, Facebook and many others). This is a fantastic way to access government or commercial data, as well as data from social media sites.

•    Extract data from PDFs. This is very difficult, as PDF is a language for printers and does not retain much information on the structure of the data that is displayed within a document. Extracting information from PDFs is beyond the scope of this book, but there are some tools and tutorials that may help you do it.

•    Screen scrape web sites. During screen scraping, you’re extracting structured content from a normal web page with the help of a scraping utility or by writing a small piece of code. While this method is very powerful and can be used in many places, it requires a bit of understanding about how the web works.

With all those great technical options, don’t forget the simple options: often it is worth to spend some time searching for a file with machine-readable data or to call the institution which is holding the data you want.

In this chapter we walk through a very basic example of scraping data from an HTML web page.

What is machine-readable data?

The goal for most of these methods is to get access to machine-readable data. Machine readable data is created for processing by a computer, instead of the presentation to a human user. The structure of such data relates to contained information, and not the way it is displayed eventually. Examples of easily machine-readable formats include CSV, XML, JSON and Excel files, while formats like Word documents, HTML pages and PDF files are more concerned with the visual layout of the information. PDF for example is a language which talks directly to your printer, it’s concerned with position of lines and dots on a page, rather than distinguishable characters.

Scraping web sites: what for?

Everyone has done this: you go to a web site, see an interesting table and try to copy it over to Excel so you can add some numbers up or store it for later. Yet this often does not really work, or the information you want is spread across a large number of web sites. Copying by hand can quickly become very tedious, so it makes sense to use a bit of code to do it.

The advantage of scraping is that you can do it with virtually any web site — from weather forecasts to government spending, even if that site does not have an API for raw data access.

What you can and cannot scrape

There are, of course, limits to what can be scraped. Some factors that make it harder to scrape a site include:

•    Badly formatted HTML code with little or no structural information e.g. older government websites.

•    Authentication systems that are supposed to prevent automatic access e.g. CAPTCHA codes and paywalls.

•    Session-based systems that use browser cookies to keep track of what the user has been doing.

•    A lack of complete item listings and possibilities for wildcard search.

•    Blocking of bulk access by the server administrators.

Another set of limitations are legal barriers: some countries recognize database rights, which may limit your right to re-use information that has been published online. Sometimes, you can choose to ignore the license and do it anyway — depending on your jurisdiction, you may have special rights as a journalist. Scraping freely available Government data should be fine, but you may wish to double check before you publish. Commercial organizations — and certain NGOs — react with less tolerance and may try to claim that you’re “sabotaging” their systems. Other information may infringe the privacy of individuals and thereby violate data privacy laws or professional ethics.

Tools that help you scrape

There are many programs that can be used to extract bulk information from a web site, including browser extensions and some web services. Depending on your browser, tools like Readability (which helps extract text from a page) or DownThemAll (which allows you to download many files at once) will help you automate some tedious tasks, while Chrome’s Scraper extension was explicitly built to extract tables from web sites. Developer extensions like FireBug (for Firefox, the same thing is already included in Chrome, Safari and IE) let you track exactly how a web site is structured and what communications happen between your browser and the server.

ScraperWiki is a web site that allows you to code scrapers in a number of different programming languages, including Python, Ruby and PHP. If you want to get started with scraping without the hassle of setting up a programming environment on your computer, this is the way to go. Other web services, such as Google Spreadsheets and Yahoo! Pipes also allow you to perform some extraction from other web sites.

How does a web scraper work?

Web scrapers are usually small pieces of code written in a programming language such as Python, Ruby or PHP. Choosing the right language is largely a question of which community you have access to: if there is someone in your newsroom or city already working with one of these languages, then it makes sense to adopt the same language.

While some of the click-and-point scraping tools mentioned before may be helpful to get started, the real complexity involved in scraping a web site is in addressing the right pages and the right elements within these pages to extract the desired information. These tasks aren’t about programming, but understanding the structure of the web site and database.

When displaying a web site, your browser will almost always make use of two technologies: HTTP is a way for it to communicate with the server and to request specific resource, such as documents, images or videos. HTML is the language in which web sites are composed.

The anatomy of a web page

Any HTML page is structured as a hierarchy of boxes (which are defined by HTML “tags”). A large box will contain many smaller ones — for example a table that has many smaller divisions: rows and cells. There are many types of tags that perform different functions — some produce boxes, others tables, images or links. Tags can also have additional properties (e.g. they can be unique identifiers) and can belong to groups called ‘classes’, which makes it possible to target and capture individual elements within a document. Selecting the appropriate elements this way and extracting their content is the key to writing a scraper.

Viewing the elements in a web page: everything can be broken up into boxes within boxes.

To scrape web pages, you’ll need to learn a bit about the different types of elements that can be in an HTML document. For example, the <table> element wraps a whole table, which has <tr> (table row) elements for its rows, which in turn contain <td> (table data) for each cell. The most common element type you will encounter is <div>, which can basically mean any block of content. The easiest way to get a feel for these elements is by using the developer toolbar in your browser: they will allow you to hover over any part of a web page and see what the underlying code is.

Tags work like book ends, marking the start and the end of a unit. For example <em> signifies the start of an italicized or emphasized piece of text and </em> signifies the end of that section. Easy.

Figure 57. The International Atomic Energy Agency’s (IAEA) portal (news.iaea.org)

An example: scraping nuclear incidents with Python

NEWS is the International Atomic Energy Agency’s (IAEA) portal on world-wide radiation incidents (and a strong contender for membership in the Weird Title Club!). The web page lists incidents in a simple, blog-like site that can be easily scraped.

To start, create a new Python scraper on ScraperWiki and you will be presented with a text area that is mostly empty, except for some scaffolding code. In another browser window, open the IAEA site and open the developer toolbar in your browser. In the “Elements” view, try to find the HTML element for one of the news item titles. Your browser’s developer toolbar helps you connect elements on the web page with the underlying HTML code.

Investigating this page will reveal that the titles are <h4> elements within a <table>. Each event is a <tr> row, which also contains a description and a date. If we want to extract the titles of all events, we should find a way to select each row in the table sequentially, while fetching all the text within the title elements.

In order to turn this process into code, we need to make ourselves aware of all the steps involved. To get a feeling for the kind of steps required, let’s play a simple game: In your ScraperWiki window, try to write up individual instructions for yourself, for each thing you are going to do while writing this scraper, like steps in a recipe (prefix each line with a hash sign to tell Python that this not real computer code). For example:

# Look for all rows in the table

# Unicorn must not overflow on left side.

Try to be as precise as you can and don’t assume that the program knows anything about the page you’re attempting to scrape.

Once you’ve written down some pseudo-code, let’s compare this to the essential code for our first scraper:

import scraperwiki

In this first section, we’re importing existing functionality from libraries — snippets of pre-written code. scraperwiki will give us the ability to download web sites, while lxml is a tool for the structured analysis of HTML documents. Good news: if you are writing a Python scraper with ScraperWiki, these two lines will always be the same.

doc_text = scraperwiki.scrape(url)

doc = html.fromstring(doc_text)

Next, the code makes a name (variable): url, and assigns the URL of the IAEA page as its value. This tells the scraper that this thing exists and we want to pay attention to it. Note that the URL itself is in quotes as it is not part of the program code but a string, a sequence of characters.

We then use the url variable as input to a function, scraperwiki.scrape. A function will provide some defined job — in this case it’ll download a web page. When it’s finished, it’ll assign its output to another variable, doc_text. doc_text will now hold the actual text of the website — not the visual form you see in your browser, but the source code, including all the tags. Since this form is not very easy to parse, we’ll use another function, html.fromstring, to generate a special representation where we can easily address elements, the so-called document object model (DOM).

In this final step, we use the DOM to find each row in our table and extract the event’s title from its header. Two new concepts are used: the for loop and element selection (.cssselect). The for loop essentially does what its name implies; it will traverse a list of items, assigning each a temporary alias (row in this case) and then run any indented instructions for each item.

The other new concept, element selection, is making use of a special language to find elements in the document. CSS selectors are normally used to add layout information to HTML elements and can be used to precisely pick an element out of a page. In this case (Line. 6) we’re selecting #tblEvents tr which will match each <tr> within the table element with the ID tblEvents (the hash simply signifies ID). Note that this will return a list of <tr> elements.

As can be seen on the next line (Line. 7), where we’re applying another selector to find any <a> (which is a hyperlink) within a <h4> (a title). Here we only want to look at a single element (there’s just one title per row), so we have to pop it off the top of the list returned by our selector with the .pop() function.

Note that some elements in the DOM contain actual text, i.e. text that is not part of any markup language, which we can access using the [element].text syntax seen on line 8. Finally, in line 9, we’re printing that text to the ScraperWiki console. If you hit run in your scraper, the smaller window should now start listing the event’s names from the IAEA web site.

You can now see a basic scraper operating: it downloads the web page, transforms it into the DOM form and then allows you to pick and extract certain content. Given this skeleton, you can try and solve some of the remaining problems using the ScraperWiki and Python documentation:

•    Can you find the address for the link in each event’s title?

•    Can you select the small box that contains the date and place by using its CSS class name and extract the element’s text?

•    ScraperWiki offers a small database to each scraper so you can store the results; copy the relevant example from their docs and adapt it so it will save the event titles, links and dates.

•    The event list has many pages; can you scrape multiple pages to get historic events as well?

As you’re trying to solve these challenges, have a look around ScraperWiki: there are many useful examples in the existing scrapers — and quite often, the data is pretty exciting, too. This way, you don’t need to start off your scraper from scratch: just choose one that is similar, fork it and adapt to your problem.

Source: http://datajournalismhandbook.org/1.0/en/getting_data_3.html

Friday, 29 May 2015

Web Scraping Services - A trending technique in data science!!!

Web scraping as a market segment is trending to be an emerging technique in data science to become an integral part of many businesses – sometimes whole companies are formed based on web scraping. Web scraping and extraction of relevant data gives businesses an insight into market trends, competition, potential customers, business performance etc.  Now question is that “what is actually web scraping and where is it used???” Let us explore web scraping, web data extraction, web mining/data mining or screen scraping in details.

What is Web Scraping?

Web Data Scraping is a great technique of extracting unstructured data from the websites and transforming that data into structured data that can be stored and analyzed in a database. Web Scraping is also known as web data extraction, web data scraping, web harvesting or screen scraping.

What you can see on the web that can be extracted. Extracting targeted information from websites assists you to take effective decisions in your business.

Web scraping is a form of data mining. The overall goal of the web scraping process is to extract information from a websites and transform it into an understandable structure like spreadsheets, database or csv. Data like item pricing, stock pricing, different reports, market pricing, product details, business leads can be gathered via web scraping efforts.

There are countless uses and potential scenarios, either business oriented or non-profit. Public institutions, companies and organizations, entrepreneurs, professionals etc. generate an enormous amount of information/data every day.

Uses of Web Scraping:

The following are some of the uses of web scraping:

•    Collect data from real estate listing

•    Collecting retailer sites data on daily basis

•    Extracting offers and discounts from a website.

•    Scraping job posting.

•    Price monitoring with competitors.

•    Gathering leads from online business directories – directory scraping

•    Keywords research

•    Gathering targeted emails for email marketing – email scraping

•    And many more.

There are various techniques used for data gathering as listed below:

•    Human copy-and-paste – takes lot of time to finish when data is huge

•    Programming the Custom Web Scraper as per the needs.

•    Using Web Scraping Softwares available in market.

Are you in search of web data scraping expert or specialist. Then you are at right place. We are the team of web scraping experts who could easily extract data from website and further structure the unstructured useful data to uncover patterns, and help businesses for decision making that helps in increasing sales, cover a wide customer base and ultimately it leads to business towards growth and success.

We have got expertise in all the web scraping techniques, scraping data from ajax enabled complex websites, bypassing CAPTCHAs, forming anonymous http request etc in providing web scraping services.

The web scraping is legal since the data is publicly and freely available on the Web. Smart WebTech can probably help you to achieve your scraping-based project goals. We would be more than happy to hear from you.

Source: http://webdata-scraping.com/web-scraping-trending-technique-in-data-science/

Tuesday, 26 May 2015

Endorsing web scraping

With more than 200 projects delivered, we stand firmly for new challenges every day. We have served above 60 clients and have won 86% of repeat business, as our main core is customer delight. Successive Softwares was approached by a client having a very exclusive set of requirements. For their project they required customised data mining, in real time to offer profitable information to their customers. Requirement stated scrapping of stock exchange data in real time so that end users can be eased in their marketing decisions. This posed as an ambitious task for us because it required processing of huge amount of data on a routine basis. We welcomed it as an event to evolve and do something aside of classic web application development.

We started with mock-ups, pursuing our very first step of IMPART Framework (Innovative Mock-up based Prototypes Analyzed to develop Reengineered Technology). Our team of experts thought of all the potential requirements with a flow and materialized it flawlessly into our mock up. It was a strenuous tasks but our excitement to do something which others still do not think of, filled our team with confidence and energy and things began to roll out perfectly. We presented our mock-up and statistics to the client as per our expectation client choose us, impressed with the efforts.

We started gathering requirements from client side and started to formulate design about the flow. The project required real time monitoring of stock exchange together with Prices, Market Turnover and then implement them into graphs. The front end part was an easy deal, we were already adept in playing with data the way required. The intractable task was to get the data. We researched and found that it can be achieved either with API or with Web Scarping and we moved with latter because of the limitations in API.

Web scraping is a compelling technique to get the required information straight out of the web page. Lack of documentation and not much forbearance forced us to make a slow start, but we kept all the requirements clear and new that we headed in the right direction.  We divided the scraping process into bits of different but related tasks. Firstly we needed to find the data which has to be captured, some of the problems faced were pagination and use of AJAX but with examination of endpoints in URL and the requests made when data is drawn, we surmounted these problems easily.

After targeting our data we focused on HTML parser which could extract data form all the targets. Using PHP we developed a parser extracting all the information and saving them in Database in a structured way.  After the required data present at our end we easily manipulated it into tables and charts and we used HIGHSTOCK for that. Entire Client side was developed in PHP with Zend frame work and we used MySQL 5.7 for server side.

During the whole development cycle our QA team insured we were delivering a quality product following all standards. We kept our client in the loop during the whole process keeping them informed about every step. Clients were also assured as they watched their project starting from scratch which developed into full fledge website. The process followed a strict time line releasing regular builds and implementing new improvements. We stood up to the expectation our client and delivered a product just as they visualized it to be.

Source: http://www.successivesoftwares.com/endorsing-web-scraping/

Monday, 25 May 2015

What you need to know about web scraping: How to understand, identify, and sometimes stop

NB: This is a gust article by Rami Essaid, co-founder and CEO of Distil Networks.

Here’s the thing about web scraping in the travel industry: everyone knows it exists but few know the details.

Details like how does web scraping happen and how will I know? Is web scraping just part of doing business online, or can it be stopped? And lastly, if web scraping can be stopped, should it always be stopped?

These questions and the challenge of web scraping are relevant to every player in the travel industry. Travel suppliers, OTAs and meta search sites are all being scraped. We have the data to prove it; over 30% of travel industry website visitors are web scrapers.

Google Analytics, and most other analytics tools do not automatically remove web scraper traffic, also called “bot” traffic, from your reports – so how would you know this non-human and potentially harmful traffic exists? You have to look for it.

This is a good time to note that I am CEO of a bot-blocking company called Distil Networks, and we serve the travel industry as well as digital publishers and eCommerce sites to protect against web scraping and data theft – we’re on a mission to make the web more secure.

So I am admittedly biased, but will do my best to provide an educational account of what we’ve learned to be true about web scraping in travel – and why this is an issue every travel company should at the very least be knowledgeable about.

Overall, I see an alarming lack of awareness around the prevalence of web scraping and bots in travel, and I see confusion around what to do about it. As we talk this through I’ll explain what these “bots” are, how to find them and how to manage them to better protect and leverage your travel business.

What are bots, web scrapers and site indexers? Which are good and which are bad?

The jargon around web scraping is confusing – bots, web scrapers, data extractors, price scrapers, site indexers and more – what’s the difference? Allow me to quickly clarify.

–> Bots: This is a general term that refers to non-human traffic, or robot traffic that is computer generated. Bots are essentially a line of code or a program that is created to perform specific tasks on a large scale.  Bots can include web scrapers, site indexers and fraud bots. Bots can be good or bad.

–> Web Scraper: (web harvesting or web data extraction) is a computer software technique of extracting information from websites (source, Wikipedia). Web scrapers are usually bad.

If your travel website is being scraped, it is most likely your competitors are collecting competitive intelligence on your prices. Some companies are even built to scrape and report on competitive price as a service. This is difficult to prove, but based on a recent Distil Networks study, prices seem to be main target.You can see more details of the study and infographic here.

One case study is Ryanair. They have been particularly unhappy about web scraping and won a lawsuit against a German company in 2008, incorporated Captcha in 2011 to stop new scrapers, and when Captcha wasn’t totally effective and Cheaptickets was still scraping, they took to the courts once again.

So Ryanair is doing what seems to be a consistent job of fending off web scrapers – at least after the scraping is performed. Unfortunately, the amount of time and energy that goes into identifying and stopping web scraping after the fact is very high, and usually this means the damage has been done.

This type of web scraping is bad because:

    Your competition is likely collecting your price data for competitive intelligence.

    Other travel companies are collecting your flights for resale without your consent.

    Identifying this type of web scraping requires a lot of time and energy, and stopping them generally requires a lot more.

Web scrapers are sometimes good

Sometimes a web scraper is a potential partner in disguise.

Meta search sites like Hipmunk sometimes get their start by scraping travel site data. Once they have enough data and enough traffic to be valuable they go to suppliers and OTAs with a partnership agreement. I’m naming Hipmunk because the Company is one of th+e few to fess up to site scraping, and one of the few who claim to have quickly stopped scraping when asked.

I’d wager that Hipmunk and others use(d) web scraping because it’s easy, and getting a decision maker at a major travel supplier on the phone is not easy, and finding legitimate channels to acquire supplier data is most definitely not easy.

I’m not saying you should allow this type of site scraping – you shouldn’t. But you should acknowledge the opportunity and create a proper channel for data sharing. And when you send your cease and desist notices to tell scrapers to stop their dirty work, also consider including a note for potential partners and indicate proper channels to request data access.

–> Site Indexer: Good.

Google, Bing and other search sites send site indexer bots all over the web to scour and prioritize content. You want to ensure your strategy includes site indexer access. Bing has long indexed travel suppliers and provided inventory links directly in search results, and recently Google has followed suit.

–> Fraud Bot: Always bad.

Fraud bots look for vulnerabilities and take advantage of your systems; these are the pesky and expensive hackers that game websites by falsely filling in forms, clicking ads, and looking for other vulnerabilities on your site. Reviews sections are a common attack vector for these types of bots.

How to identify and block bad bots and web scrapers

Now that you know the difference between good and bad web scrapers and bots, how do you identify them and how do you stop the bad ones? The first thing to do is incorporate bot-identification into your website security program. There are a number of ways to do this.

In-house

When building an in house solution, it is important to understand that fighting off bots is an arms race. Every day web scraping technology evolves and new bots are written. To have an effective solution, you need a dynamic strategy that is always adapting.

When considering in-house solutions, here are a few common tactics:

    CAPTCHAs – Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHA), exist to ensure that user input has not been generated by a computer. This has been the most common method deployed because it is simple to integrate and can be effective, at least at first. The problem is that Captcha’s can be beaten with a little workand more importantly, they are a nuisance to end usersthat can lead to a loss of business.

    Rate Limiting- Advanced scraping utilities are very adept at mimicking normal browsing behavior but most hastily written scripts are not. Bots will follow links and make web requests at a much more frequent, and consistent, rate than normal human users. Limiting IP’s that make several requests per second would be able to catch basic bot behavior.

    IP Blacklists - Subscribing to lists of known botnets & anonymous proxies and uploading them to your firewall access control list will give you a baseline of protection. A good number of scrapers employ botnets and Tor nodes to hide their true location and identity. Always maintain an active blacklist that contains the IP addresses of known scrapers and botnets as well as Tor nodes.

    Add-on Modules – Many companies already own hardware that offers some layer of security. Now, many of those hardware providers are also offering additional modules to try and combat bot attacks. As many companies move more of their services off premise, leveraging cloud hosting and CDN providers, the market share for this type of solution is shrinking.

    It is also important to note that these types of solutions are a good baseline but should not be expected to stop all bots. After all, this is not the core competency of the hardware you are buying, but a mere plugin.

Some example providers are:

    Impreva SecureSphere- Imperva offers Web Application Firewalls, or WAF’s. This is an appliance that applies a set of rules to an HTTP connection. Generally, these rules cover common attacks such as Cross-site Scripting (XSS) and SQL Injection. By customizing the rules to your application, many attacks can be identified and blocked. The effort to perform this customization can be significant and needs to be maintained as the application is modified.

    F5 – ASM – F5 offers many modules on their BigIP load balancers, one of which is the ASM. This module adds WAF functionality directly into the load balancer. Additionally, F5 has added policy-based web application security protection.

Software-as-a-service

There are website security software options that include, and sometimes specialize in web scraping protection. This type of solution, from my perspective, is the most effective path.

The SaaS model allows someone else to manage the problem for you and respond with more efficiency even as new threats evolve.  Again, I’m admittedly biased as I co-founded Distil Networks.

When shopping for a SaaS solution to protect against web scraping, you should consider some of the following factors:

•    Does the provider update new threats and rules in real time?

•    How does the solution block suspected non-human visitors?

•    Which types of proactive blocking techniques, such as code injections, does the provider deploy?

•    Which of the reactive techniques, such as rate limiting, are used?

•    Does the solution look at all of your traffic or a snapshot?

•    Can the solution block bots before they reach your infrastructure – and your data?

•    What kind of latency does this solution introduce?

I hope you now have a clearer understanding of web scraping and why it has become so prevalent in travel, and even more important, what you should do to protect and leverage these occurrences.

Source: http://www.tnooz.com/article/what-you-need-to-know-about-web-scraping-how-to-understand-identify-and-sometimes-stop/

Saturday, 23 May 2015

Roles of Data Mining in Predicting, Tracking, and Containing the Ebola Outbreak

One of the most diverse continents on earth, Africa astounds the world with its vast savannas and great deserts and with its ancient architecture and modern cities, but Africa also has its share of tragedies and woes.

First identified in Democratic Republic of Congo’s Ebola River in 1976, Ebola Hemorrhagic Fever, a deadly zoonotic disease caused by Ebola virus, has been spreading in West Africa like a wildfire, engulfing everything on its way and creating widespread panic.

What has added insult to injury is the fact that the region has long endured the severe consequences of civil wars and social conflicts, and diseases like malaria, HIV/AIDS, yellow fever, cholera etc. have remained endemic to the region for a long time, causing tens of thousands of deaths every year.

Reportedly, Ebola has already killed at least 2,296 people, and there are about 3,685 confirmed cases of infection. Mortality rate has been swinging between 50% to 90%, depending on the quality of care and nutrition. According to WHO, the disease is likely to infect as much as 20,000 people before it is finally brought under control.

Crisis of Data

When it comes to healthcare management, clinical data is one of the key components. The value of data becomes more urgent in the emergency situation like that of West Africa. The more relevant data you have, the bigger picture you can create for taking aggressive measures. To use Peter Drucker’s words, “What gets measured gets managed.”

Factual data is a precondition for the doctors and health science experts working in the field for measuring and managing the situation. Data helps them to assess their successes or failures and reorient their actions. One of the important reasons why the fight against the Ebola outbreak is turning out into a losing battle is the insufficiency of data. Recently, Scientific American magazine wrote:

Right now, there are not even enough beds for sick patients nor enough data coming in to help track cases. Surveillance and tracking of those who were possibly exposed to Ebola remain inadequate.

In Science magazine, Gretchen Vogel suggests that the death toll of Ebola patients could be much higher than it is currently estimated. She says, “Exactly how many unrecorded Ebola deaths have occurred will never be known. Health officials are keeping track of suspected and probable cases, many of which are people who died before they could be tested.” Greg Slabodkin voices similar concerns in Health Data Management and points at the need of an integrated global biosurveillance system.

The absence of reliable and actionable data has badly hampered the efforts of combatting Ebola and providing proper medical care to the victims. CDC Director Dr. Tom Frieden describes it as a “fog-of-war situation”.

Data Mining: Bots Were the First to Warn

When you flip the coin, however, the situation is not completely bleak and desperate. Even if Big Data technologies have fallen short in predicting, tracking, and containing the epidemic, mainly due to the lack of data from the ground, it has not entirely failed. Data scientists and healthcare experts world over are making concerted efforts to know, track, and defeat the Ebola virus—some on the ground and some in their labs.

The increasing level of collaboration among the biomedical specialists, geneticist, virologists, and IT experts has definitely contributed to slow down the transmission of the virulent disease dubbed as “the plague of modern day”. Médecins Sans Frontières and Healthmap.org are the excellent examples in this regard.

    “By deploying bots and crawlers and by using advanced machine learning algorithms, the Boston-based global infectious disease surveillance system, HealthMap was able to predict and raise concerns about the spread of a mysterious hemorrhagic fever in West Africa nine days earlier than WHO did.”

Run by a team of 45 researchers, epidemiologists, and software developers at Boston Children’s Hospital, HealthMap mines data from search engine queries, social media platforms, health information sites, news reports and crowd-sourced information to track the transmission of the disease and provides an up-to-date timeline report with an interactive map, making it easier for the international health agencies to devise more effective action plans.

HealthMap serves as a good example of how crucial Big Data and data mining technologies could be for handling a healthcare emergency with fact-based and data-driven decisions.

Ebola Data

In their letter to The Lancet, research scientist Rashid Ansumana and his colleagues, working on Ebola in Sierra Leone, highlighted on the need of developing epidemic surveillance systems “by adopting new data-sharing technologies.” They wrote, “Emerging technologies can help early warning systems, outbreak response, and communication between health-care providers, wildlife and veterinary professionals, local and national health authorities, and international health agencies.”

Data-Driven Initiatives to Control the Outbreak

The era of systematic use of data for making better epidemiological predictions and for finding effective healthcare solutions began with Google Flue Trends in 2007, and the rapidly developing tools, technologies, and practices in Big Data have increased the roles of data in healthcare management.

There are a number of data-driven undertakings in progress which have contributed to counter the raging spread of Ebola. Brockmann Lab, run by Professor Dirk Brockmann and his colleagues, for example, has created a computer model for studying correlations and probabilities in the explosion of new cases of infection.

World Airtraffic  Transportation and Relative Import Risk, Source: Brockmann Lab

By applying computational and statistical models, they predict which areas, cities or regions in the world are at the risk of becoming the next Ebola epidemic hotspots. Similarly, Alessandro Vespignani–a network scientist, statistical physicist, and Northeastern professor–has been using human mobility network data to track the cases of Ebola infection and dissemination.

The Swedish NGO Flowminder Foundation has been aggregating, mining, and analyzing anonymized mobile phone location data and is developing national mobility estimates for West Africa to help the local and international agencies to combat the disease.

Meanwhile, innovations with Epi Info VHF, a software tool for case management, contact tracing, analysis and reporting services for Ebola and other hemorrhagic fever outbreaks and OpenStreetMap project for getting location information and spatial data of the affected areas have further helped to guide the intervention initiatives.

However, with all optimism about the growing roles of Big Data and data mining, we also need to be mindful about their limitations. Newsweek aptly puts: “While no media-trawling bot could ever replace national and international health agencies, such tools may be starting to help fill in some of the most gaping holes in real-time knowledge.”

Source: http://www.grepsr.com/blog/data-mining-tracking-ebola-outbreak/

Wednesday, 20 May 2015

Hard-Scraped Hardwood Flooring: Restoration of History

Throughout History hardwood flooring has undergone dramatic changes from the meticulous hard-scraped hardwood polished floors of majestic plantations of the Deep South, to modern day technology providing maintenance free wood flooring designed for comfort and appearance. The hand-scraped hardwood floors of the South, depicted charm with old rustic nature and character that was often associated with this time era. To date, hand-scraped hardwood flooring is being revitalized and used in up-scale homes and places of businesses to restore the old country charm that once faded into oblivion.

As the name implies, hand-scraped flooring involves the retexturing the top layer of flooring material by various methods in an attempts to mimic the rustic appearance of flooring in yesteryears. Depending on the degree of texture required, hand scraping hardwood material is often accomplished by highly skilled craftsmen with specialized tools and years of experience perfecting this procedure. When properly done, hand-scraped hardwood floors add texture, richness and uniqueness not offered in any similar hardwood flooring product.

Rooted with history, these types of floors are available in finished or unfinished surfaces. The majority of the individuals selecting hand-scraped hardwood flooring elect a prefinished floor to reduce costs per square foot in installation and finishing labor charges, allowing for budget guidelines to bend, not break. As expected, hand-scraped flooring is expensive and depending on the grade and finish selected, can range from $15-40$ per square foot and beyond for material only. Preparation of the material is labor intensive adding to the overall cost per square foot dramatically. Recommended professional installation can and often does increase the cost per square foot as well, placing this method of hardwood flooring well out of reach of the average hardwood floor purchaser.

With numerous selections of hand-scraped finishes available, each finish is designed to bring out a different appearance making it a one-of-a-kind work of art. These numerous finish selections include:

• Time worn aged, dark coloring stain application bringing out grain characteristics

• Wire brushed, providing a highlighted "grainy" effect with obvious rough texture

• Hand sculpted, smoother distressed uniform appearance

• French Bleed, staining of edges and side joints with a much darker stain to give a bleeding effect to the wood

• Hand Hewn or Rough Sawn, with visible and noticeable saw marks

Regardless of the selection made, scraped flooring cannot be compared to any other available flooring material based on durability, strength and visual appearance. Limited by only the imagination and creativity, several wood species can be used to create unusual floor patterns, highlighting main focal points of personal libraries and art collections.

The precise process utilized in the creation of scraped floors projects a custom look with deep color and subtle warm highlights. With radiant natural light reflecting off this type of floor, the effect of beauty and depth is radiated in a fashion that fills the room with solitude and serenity encompassing all that enter. Hand-scraped hardwood floors speak of the past, a time of decent, a time or war and ambiguity towards other races and the blood- shed so that all men could be treated as equals. More than exquisite flooring, hand-scraped hardwood flooring is the restoration of History.

Source: http://ezinearticles.com/?Hard-Scraped-Hardwood-Flooring:-Restoration-of-History&id=6333218

Sunday, 17 May 2015

Introducing ScrapeShield: Discover, Defend & Deter Content Scraping

If you're a publisher, whether an individual blogger or major media outlet, you've undoubtedly experienced content scraping. Searching the web for an article you've published or other original content you've created and you find it copied and republished on some other random website. Often the site will be full of ads. And, sometimes, it will even rank higher in search results than your original work.

While you may envision an army of individuals copying and pasting your content on their sites, the truth is content scraping is typically an automated process with bots that grab original content and then republish it without human intervention onto link farm sites. CloudFlare has blocked many of these bots automatically in the past, but we decided it was time to do something to more actively stop them.

Introducing ScrapeShield

ScrapeShield is an app created by the CloudFlare team. It incorporates several existing CloudFlare features like email obfuscation and hotlink protection that serve to protect from content scraping and adds a number of new features as well. Because we believe every publisher of original content should be able to understand and control how their work is used, we're providing ScrapeShield free for every CloudFlare user.

Detect, Defend & Deter

ScrapeShield has different elements to help you detect when your content is scraped, defend your site against content scrapers, and even deter content scrapers from targeting you in the first place. If you enable ScrapeShield, CloudFlare will automatically insert invisible tracking beacons in your content. When automated bots scrape your content, they pull the beacons along with them. CloudFlare detects these beacons when they ping from sites that aren't your own. You can access your ScrapeShield control panel to see where your content is being republished. Not only is this useful in showing scraping, but you can also see users who are reading your content through proxy services like Flipboard or Pulse.

The data from the content beacons is fed back into CloudFlare's protection system. As CloudFlare identifies content scraping bots, we automatically prevent them from accessing your site. Just like Project Honey Pot, the original inspiration for CloudFlare, used traps to detect when spammers were harvesting email addresses, CloudFlare now uses data from ScrapeShield to identify content scrapers and keep them off publishers' sites.

Maze

We didn't want to just stop scrapers from attacking sites on CloudFlare, we also wanted to tie up their resources so they couldn't harm the rest of the web. To do this, we created Maze. Maze routes known content scrapers who are visiting ScrapeShield-protected sites into a virtual labyrinth of gibirish and gobbledygook. We dynamically throttle the bandwidth and speed so instead of the pages loading as fast as possible, the connection is held open to the scrapers and their resources are tied up.

We use excess resources on the CloudFlare network to generate Maze, and it doesn't consume any of our publishers' resources or add any additional load to their sites. What's beautiful about the system is that the only way that content scrapers can be sure they're avoiding Maze is to avoid CloudFlare's IP addresses entirely. For any content scrapers who may be reading this, here's a helpful list of all of our IPs so you can make sure to stay away.

No Pinning

Finally, with the rise of sites like Pinterest, innocent content scraping may become even more prolific. While many sites welcome their images being pinned, we wanted to make it easy to opt out. ScrapeShield includes an option to add the no-pinning meta tag to your site to prevent your images from being pinned to the site. As other similar services include a mechanism to opt out, expect that we'll include an easy way for you to do so right from the ScrapeShield interface.

The health of the web depends on publishers creating original content getting credit for their creations. Cloud Flare is committed to building a better web and we're extremely excited about ScrapeShield as a new tool to help publishers do exactly that.

Source: https://blog.cloudflare.com/introducing-scrapeshield-discover-defend-dete/

Wednesday, 13 May 2015

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source: http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Sunday, 3 May 2015

Web Data Scraping - Scrape Business Data in no time

The Internet has evolved as one of the largest repositories of information for your business. You can design intelligent business processes to access a whole host of relevant information sources that will help you strategize, implement and deliver effective business objectives. Leveraging the benefits and usefulness of Web Scraping Tools is one such methodology that most businesses have adopted. Let us take a look at some of the ways it helps you easily scrape data relevant for your business.

Scraping for Business Information

Web Data Scraping is a technique, employed by most organizations. It involves the implementation of tools that help businesses extract unstructured data and convert them into usable business information. The focus of most scraping initiatives revolves around the organization’s need to glean the following information:

•    Competitor analysis to structure and strategist effectively

•    Price comparisons to price their products competitively

•    Customer feedbacks to enhance their product portfolio and provide customers with better brand experience   Market dynamics to help them identify areas of opportunities and threats

Using Scraping Tools

The abundance of information available on the Internet that helps you build up a productive business strategy can be easily extracted and leveraged to benefit your business. Tools have been designed with intuitive interface and intelligent algorithms which help in furthering this end.

Website Data Scraping tools are equipped for compatibility with a wide variety of applications so as to be able to explore a huge range of information sources.  These tools are fully automated and display the drag and drop facility ensuring users get to leverage the benefits of speed and convenience.

Data extraction tools are not only adept at extracting data, but are also equally well-equipped to combine relevant statistics from several social media platforms like YouTube, Twitter, and Google Analytics and so on. This helps businesses to analyse trends and plan strategies accordingly.

Challenges of the Data Scraping Process

Just as there is no dearth of data to be collected from the Web, there is also an abundance of web scraping tools to execute the data collection process. However, the capability of the tool to help you collect the appropriate data needs to be assured before you can proceed with its implementation. Some of the challenges faced by most businesses owing to their wrong choice of tools include the following:

•    Run-of-the-mill extraction tools are unable to scale up sufficiently in order to capture large volumes of data

•    Some tools are also unable to establish compatibility with most data sources and therefore do not provide a holistic data collection approach

•    Some tools are also not equipped to conduct an automatic detection of updates made to a data source and therefore end up providing inaccurate data.

In the light of all this it is essential that you identify the right tool for your need and select one that is embedded with an updated technology to help you achieve the following:

•    Ensure that you are able to access the appropriate data that you want

•    Help you structure it in the format you want

•    Provide quick and easy access to all available data sources no matter how complex

•    Run accurately and is a reliable source to help you churn out usable information.

Source: http://scraping-solutions.blogspot.in/2014_07_01_archive.html

Tuesday, 28 April 2015

Web Scraping – Effective Way of Improving Market Presence

Web scraping is a technique that is fast making its presence felt in the world of internet by its sheer weight of being effective. It is a technique that uses software to crawl through the internet and gather up all the relevant and important information that one would need for their products.

The information gathered by the web scraping can be used for various things such as data integration, web mashup, online comparison of price and much more. Web scraping uses sophisticated software that crawls through the internet and gathers up all related information for the entity that you are looking for. The information that is gathered up is an automated, systematic, and very structured way. This allows for easy understanding of the gathered information. Though this is one of the best ways for data extraction there are quite a few things that one must be aware of before getting into web scraping.

Being aware of the following things keep you at a better position not only leverage the best deal, but also to negotiate properly.

•    For data mining the first thing that one should be very sure of is the kind of data they want. One has to define properly what kind of data they want and also what would be the purpose of the same. For an instance if you wish to get a closer look at your competitors, it would be a wise to let the data scraping service providers know who your competitors are. This would allow them to gather better information. Similarly if you are looking for getting new customers getting contact data from existing players in the respective industry would be helpful.

•    One should also be aware of the structure in which they want the data. A simple data structure has the entity name in the row and the property of the entity is kept in the cells of the rows. However, one can also opt for data structure in chart. Apart from the above, there is just one more thing that one needs to keep in mind while using the data mining services; it is the number of data extraction. At times a onetime data extraction would be sufficient whereas at other times periodic extractions or general reports are required.

If you are aware of all the above points, then you are very much inline of going ahead and taking the help of scrape website data. Knowing the above points would allow you to know what exactly to ask from your vendor and likewise quote. One can make the most of the data extraction services with the help of either the web scraping or web crawling services.

Source: https://3idatascraping.wordpress.com/2014/01/07/web-scraping-effective-way-of-improving-market-presence/

Sunday, 26 April 2015

Scraping the Bottom of the Barrel - The Perils of Online Article Marketing

Many online article marketers so desperately wish to succeed, they want to dump corporate life and work for themselves out of their home. They decide they are going to create an online money making website. Therefore, they look around to see what everyone else is doing, and watch the methods others use to attract online buyers, and then they mimic their marketing, their strategies, and their business models.

Still, if you are copying what other people (less ethical people) are doing in online article marketing, those which are scraping the bottom of the barrel and using false advertising and misrepresentations, then all you are really doing is perpetuating distrust on the Internet. Therefore, you are hurting everyone, including people like me. You must realize that people like me don't appreciate that.

Let me give you a few examples of some of the things going on out there, thing that are being done by people who are ethically challenged. Far too many people write articles and then on their byline they send the Internet surfer or reader of the article to a website that has a squeeze page. The squeeze page has no real information on it, rather it asks for their name and e-mail address.

If the would-be Internet surfer is unwise enough to type in their name and email address they will be spammed by e-mail, receiving various hard-sell marketing pieces. Then, if the Internet Surfer does decide to put in their e-mail address, the website grants them access and then takes them to the page with information about what they are selling, or their online marketing "make you a millionaire" scheme.

Generally, these are five page sales letters, with tons of testimonials of people you've never heard of, and may not actually exist, and all sorts of unsubstantiated earnings claims of how much money you will make if you give them $39.35 by way of PayPal, for this limited offer "Now!" And they will send you an E-book with a strategic plan of how you can duplicate what they are doing. The reality is whatever they are doing is questionable to begin with.

If you are going to do online article marketing please don't scrape the bottom of the barrel, there's just too much competition down there from what I can see. Please consider all this.

Source: http://ezinearticles.com/?Scraping-the-Bottom-of-the-Barrel---The-Perils-of-Online-Article-Marketing&id=2710103

Wednesday, 22 April 2015

How to Properly Scrape Windows During The Cleaning Process

Removing ordinary dirt such as dust, fingerprints, and oil from windows seem simple enough. However, sometimes, you may find stubborn caked-on dirt or debris on your windows that cannot be removed by standard window cleaning techniques such as scrubbing or using a squeegee. The best way to remove caked-on dirt on your windows is to scrape it off. Nonetheless, you have to be extra careful when you are scraping your windows, because they can be easily scratched and damaged. Here are a number of rules that you need to follow when you are scraping windows.

Rule No. 1: It is recommended that you use a professional window scraper to remove caked-on dirt and debris from your windows. This type of scraper is specially made for use on glass, and it comes with certain features that can prevent scratching and other kinds of damage.

Rule No. 2: It is important to inspect your window scraper before using it. Take a look at the blade of the scraper and make sure that it is not rusted. Also, it must not be bent or chipped off at the corners. If you are not certain whether the blade is in a good enough condition, you should just play it safe by using a new blade.

Rule No. 3: When you are working with a window scraper, always use forward plow-like scraping motions. Scrape forward and lift the scraper off the glass, and then scrape forward again. Try not to slide the scraper backwards, because you may trap debris under the blade when you do so. Consequently, the scraper may scratch the glass.

Rule No. 4: Be extra cautious when you are using a window scraper on tempered glass. Tempered glass may have raised imperfections, which make it more vulnerable to scratches. To find out if the window that you are scraping is made of tempered glass, you have to look for a label in one of its corners.

Window Scraping Procedures

Before you start scraping, you have to wet your window with soapy water first. Then, find out how the window scraper works by testing it in a corner. Scrape on the same spot three or four times in forward motion. If you find that the scraper is moving smoothly and not scratching the glass, you can continue to work on the rest of the window. On the other hand, if you feel as if the scraper is sliding on sandpaper, you have to stop scraping. This indicates that the glass may be flawed and have raised imperfections, and scraping will result in scratches.

After you have ascertained that it is safe to scrape your window, start working along the edges. It is best that you start scraping from the middle of an edge, moving towards the corners. Work in a one or two inch pattern, until all the edges of the glass are properly scraped. After that, scrape the rest of the window in a straight pattern of four or five inches, working from top to bottom. If you find that the window is beginning to dry while you are working, wet it with soapy water again.

Source: http://ezinearticles.com/?How-to-Properly-Scrape-Windows-During-The-Cleaning-Process&id=6592930

Sunday, 19 April 2015

What is HTML Scraping and how it works

There are many reasons why there may be a requirement to pull data or information from other sites, and usually the process begins after checking whether the site has an official API. There are very few people who are aware about the presence of structured data that is supported by every website automatically. We are basically talking about pulling data right from the HTML, also referred to as HTML scraping. This is an awesome way of gleaning data and information from third party websites.

Any webpage content that can be viewed can be scraped without any trouble. If there is any way provided by the website to the browser of the visitor to download content and use the same in a highly structured manner, in that case, accessing of the content programmatically is possible. HTML scraping works in an amazing manner.

Before indulging in HTML scraping, one can inspect the browser for network traffic. Site owners have a couple of tricks up their sleeve to thwart this access, but majority of them can be worked around.

Before moving on to how HTML scraping works, we must understand the reasons behind the same. Why is scraping needed? Once you get a satisfactory answer to this question, you can start looking for RSS or API feeds or various other traditional structured data forms. It is significant to understand that when compared with APIs, websites are more significant.

The most important advantage of the same is the maintenance of their websites where a lot of visitors visit rather than safeguarding structured data feeds. With Tweeter, the same has been publicly seen when it clamps down on the developer ecosystem. Many times, API feeds change or move without any prior warning. Many times, it can also be a deliberate attempt, but mostly, such issues or problems erupt as there is no authority or an organization that maintains or takes care of the structured data. It is rarely noticed, if the same gets severely mangled or goes offline. In case the website has certain issues or the website no longer works, the problem is more in the form of a ball in your court requiring dealing with the same without losing any time. api-comic-image

Rate limiting is another factor that needs a lot of thinking and in case of public websites, it virtually doesn’t exist. Besides some occasional sign up pages or captchas, many business websites fail to create and built defenses against any unwarranted automated access. Many times, a single website can be scraped for four hours straight without anyone noticing. There are chances that you would not be viewed under DDOS attack unless concurrent requests are being made by you. You will be seen just as an avid visitor or an enthusiast in the logs, that too, in case anyone is looking.

Another factor in HTML scraping is that one can easily access any website anonymously. Behavior tracking can be done with a few ways by the administrator of the website and this turns out to be beneficial if you want to privately gather the data. Many times, registration is imperative with APIs in order to get key and with any request being sent, this key also needs to be sent. But, in case of simple and straightforward HTTP requests, the visitor can stay anonymous besides cookies and IP address, which can again be spoofed.

The availability of HTML scraping is universal and there is no need to wait for the opening of the site for an API or for contacting anyone in the organization. One simply needs to spend some time and browse websites at a leisurely pace until the data you want is available and then find out the basic patterns to access the same.

Now you need to don a hat of a professional scraper and simply dive in. Initially, it may take some time to work up figuring out the way the data have been structured and the way it can be accessed just as we read APIs. If there is no documentation unlike APIs, you need to be a little more smart about it and use clever tricks.

Some of the most used tricks are

Data Fetching


The first thing that is required is data fetching. Find endpoints to begin with, that is the URLs that can help in returning the data that is required. If you are pretty sure about the data and the way it should be structured so as to match your requirements, you will require a particular subset for the same and later you can indulge in site browsing using the navigation tools.

GET Parameter

The URLs must be paid attention to and see the way it changes as you indulge in clicking between the sections and the way they divide into various subsections. Before starting, the other option that can be used is to straight away go to the search functionality of the site. Certain terms can be typed and the URL needs to be focused again for watching the changes on the basis of what is being searched. A GET parameter will be probably seen like q which changes on the basis of the search term used by you. Other GET parameters that are not being used can be removed from the URL until only the ones that are needed are left for data loading. Before a query string, there must always be a “?” beginning.

Now the time has come when you would have started to come across the data that you would like to see and want to access, but sometimes, there may be certain pagination issues that require to be dealt with. Due to these issues, you may not be able to see the data in its entirety. Single requests are kept away by many APIs as well from database slamming. Many times, clicking the next page can add some offset parameter that helps in data visibility on the page. All these steps will help you succeed in HTML scraping.

Source: https://www.promptcloud.com/blog/what-is-html-scraping-and-how-it-works/

Wednesday, 8 April 2015

Thoughts on scraping SERPs and APIs

Google says that scraping keyword rankings is against their policy from what I've read. Bummer. We comprise a lot of reports and manual finding and entry was a pain. Enter Moz! We still manually check and compare, but it's nice having that tool. I'm confused now though about practices and getting SERPs in an automated way. Here are my questions

  1.     Is it against policy to get SERPs from an automated method? If that is the case, isn't Moz breaking this policy with it's awesome keyword tracker?
  2.     If it's not, and we wanted to grab that kind of data, how would we do it? Right now, Moz's API doesn't offer this data. I thought Raven Tools at one point offered this, but they don't now from what I've read. Are there any APIs out there that we can grab this data and do what we want with it? (let's day build our own dashboard)?

Thanks for any clarification and input!

Source: http://moz.com/community/q/thoughts-on-scraping-serps-and-apis

Sunday, 5 April 2015

Some Traps to know and avoid in Web Scraping

In the present day and age, web scraping comes across as a handy tool in the right hands. In essence, web scraping means quickly crawling the web for specific information, using pre-written programs. Scraping efforts are designed to crawl and analyze the data of entire websites, and saving the parts that are needed. Many industries have successfully used web scraping to create massive banks of relevant, actionable data that they use on a daily basis to further their business interests and provide better service to customers. This is the age of the Big Data, and web scraping is one of the ways in which businesses can tap into this huge data repository and come up with relevant information that aids them in every way.

Web scraping, however, does come with its own share of problems and roadblocks. With every passing day, a growing number of websites are trying to actively minimize the instance of scraping and protect their own data to stay afloat in today’s situation of immense competition. There are several other complications which might arise and several traps that can slow you down during your web scraping pursuits. Knowing about these traps and how to avoid them can be of great help if you want to successfully accomplish your web scraping goals and get the amount of data that you require.

Complications in Web Scraping

Over time, various complications have risen in the field of web scraping. Many websites have started to get paranoid about data duplication and data security problems and have begun to protect their data in many ways. Some websites are not generally agreeable to the moral and ethical implications of web scraping, and do not want their content to be scraped. There are many places where website owners can set traps and roadblocks to slow down or stop web scraping activities. Major search engines also have a system in place to discourage scraping of search engine results. Last but not the least, many websites and web services announce a blanket ban on web scraping and say the same in their terms and conditions, potentially leading to legal issues in the event of any scraping.

Here are some of the most common complications that you might face during your web scraping efforts which you should be particularly aware about –

•    Some locations on the intranet might discourage web scraping to prevent data duplication or data theft.

•    Many websites have in place a number of different traps to detect and ban web scraping tools and programs.

•    Certain websites make it clear in their terms and conditions that they consider web scraping an infringement of their privacy and might even consider legal redress.

•    In a number of locations, simple measures are implemented to prevent non-human traffic to websites, making it difficult for web scraping tools to go on collecting data at a fast pace.

To surmount these difficulties, you need a deeper and more insightful understanding of the way web scraping works and also the attitude of website owners towards web scraping efforts. Most major issues can be subverted or quietly avoided if you maintain good working practice during your web scraping efforts and understand the mentality of the people whose sites you are scraping.

Web Crawling Services & Web Scraping Services


Common Problems

With automated scraping, you might face a number of common problems. The behavior of web scraping programs or spiders presents a certain picture to the target website. It then uses this behavior to distinguish between human users and web scraping spiders. Depending on that information, a website may or may not employ particular web scraping traps to stop your efforts. Some of the commonly employed traps are –

Crawling Pattern Checks – Some websites detect scraping activities by analyzing crawling patterns. Web scraping robots follow a distinct crawling pattern which incorporates repetitive tasks like visiting links and copying content. By carefully analyzing these patterns, websites can determine that they are being caused by a web scraping robot and not a human user, and can take preventive measures.

Honeypots – Some websites have honeypots in their webpages to detect and block web scraping activities. These can be in the form of links that are not visible to human users, being disguised in a certain way. Since your web crawler program does not operate the way a human user does, it can try and scrape information from that link. As a result, the website can detect the scraping effort and block the source IP addresses.

Policies – Some websites make it absolutely apparent in their terms and conditions that they are particularly averse to web scraping activities on their content. This can act as a deterrent and make you vulnerable against possible ethical and legal implications.

Infinite Loops – Your web scraping program can be tricked into visiting the same URL again and again by using certain URL building techniques.

These traps in web scraping can prove to be detrimental to your efforts and you need to find innovative and effective ways to surpass these problems. Learning some web crawler tips to avoid traps and judiciously using them is a great way of making sure that your web scraping requirements are met without any hassle.

What you can do

The first and foremost rule of thumb about web scraping is that you have to make your efforts as inconspicuous as possible. This way you will not arouse suspicion and negative behavior from your target websites. To this end, you need a well-designed web scraping program with a human touch. Such a program can operate in flexible ways so as to not alert website owners through the usual traffic criteria used to spot scraping tools.

Web scraping for ecommerce data extraction

Some of the measures that you can implement to ensure that you steer clear of common web scraping traps are –

•    The first thing that you need to do is to ascertain if a particular website that you are trying to scrape has any particular dislike towards web scraping tools. If you see any indication in their terms and conditions, tread cautiously and stop scraping their website if you receive any notification regarding their lack of approval. Being polite and honest can help you get away with a lot.

•    Try and minimize the load on every single website that you visit for scraping. Putting a high load on websites can alert them towards your intentions and often might cause them to develop a negative attitude. To decrease the overall load on a particular website, there are many techniques that you can employ.

•    Start by caching the pages that you have already crawled to ensure that you do not have to load them again.

•    Also store the URLs of crawled pages.

•    Take things slow and do not flood the website with multiple parallel requests that put a strain on their resources.

•    Handle your scraping in gentle phases and take only the content you require.

•    Your scraping spider should be able to diversify its actions, change its crawling pattern and present a polymorphic front to websites, so as not to cause an alarm and put them on the defensive.

•    Arrive at an optimum crawling speed, so as to not tax the resources and bandwidth of the target website. Use auto throttling mechanisms to optimize web traffic and put random breaks in between page requests, with the lowest possible number of concurrent requests that you can work with.

•    Use multiple IP addresses for your scraping efforts, or take advantage of proxy servers and VPN services. This will help to minimize the danger of getting trapped and blacklisted by a website.

•    Be prepared to understand the respect the express wishes and policies of a website regarding web scraping by taking a good look at the target ‘robots.txt’ file. This file contains clear instructions on the exact pages that you are allowed to crawl, and the requisite intervals between page requests. It might also specify that you use a pre-determined user agent identification string that classifies you as a scraping bot. adhering to these instructions minimizes the chance of getting on the bad side of website owners and risking bans.

Use an advanced tool for web scraping which can store and check data, URLs and patterns. Whether your web scraping needs are confined to one domain or spread over many, you need to appreciate that many website owners do not take kindly to scraping. The trick here is to ensure that you maintain industry best practices while extracting data from websites. This prevents any incident of misunderstanding, and allows you a clear pathway to most of the data sources that you want to leverage for your requirements.

Hope this article helps in understanding the different traps and roadblocks that you might face during your web scraping endeavors. This will help you in figuring out smart, sensible ways to work around them and make sure that your experience remains smooth. This way, you can keep receiving the important information that you need with web scraping. Following these basic guidelines can help you prevent getting banned or blacklisted and stay in the good books of website owners. This will allow you continue with your web scraping activities unencumbered.

Source:https://www.promptcloud.com/blog/some-traps-to-avoid-in-web-scraping/