Hotfrog.com Data Scraping: April 2015

Tuesday, 28 April 2015

Web Scraping – Effective Way of Improving Market Presence

Web scraping is a technique that is fast making its presence felt in the world of internet by its sheer weight of being effective. It is a technique that uses software to crawl through the internet and gather up all the relevant and important information that one would need for their products.

The information gathered by the web scraping can be used for various things such as data integration, web mashup, online comparison of price and much more. Web scraping uses sophisticated software that crawls through the internet and gathers up all related information for the entity that you are looking for. The information that is gathered up is an automated, systematic, and very structured way. This allows for easy understanding of the gathered information. Though this is one of the best ways for data extraction there are quite a few things that one must be aware of before getting into web scraping.

Being aware of the following things keep you at a better position not only leverage the best deal, but also to negotiate properly.

• For data mining the first thing that one should be very sure of is the kind of data they want. One has to define properly what kind of data they want and also what would be the purpose of the same. For an instance if you wish to get a closer look at your competitors, it would be a wise to let the data scraping service providers know who your competitors are. This would allow them to gather better information. Similarly if you are looking for getting new customers getting contact data from existing players in the respective industry would be helpful.

• One should also be aware of the structure in which they want the data. A simple data structure has the entity name in the row and the property of the entity is kept in the cells of the rows. However, one can also opt for data structure in chart. Apart from the above, there is just one more thing that one needs to keep in mind while using the data mining services; it is the number of data extraction. At times a onetime data extraction would be sufficient whereas at other times periodic extractions or general reports are required.

If you are aware of all the above points, then you are very much inline of going ahead and taking the help of scrape website data. Knowing the above points would allow you to know what exactly to ask from your vendor and likewise quote. One can make the most of the data extraction services with the help of either the web scraping or web crawling services.

Source: https://3idatascraping.wordpress.com/2014/01/07/web-scraping-effective-way-of-improving-market-presence/

Sunday, 26 April 2015

Scraping the Bottom of the Barrel - The Perils of Online Article Marketing

Many online article marketers so desperately wish to succeed, they want to dump corporate life and work for themselves out of their home. They decide they are going to create an online money making website. Therefore, they look around to see what everyone else is doing, and watch the methods others use to attract online buyers, and then they mimic their marketing, their strategies, and their business models.

Still, if you are copying what other people (less ethical people) are doing in online article marketing, those which are scraping the bottom of the barrel and using false advertising and misrepresentations, then all you are really doing is perpetuating distrust on the Internet. Therefore, you are hurting everyone, including people like me. You must realize that people like me don't appreciate that.

Let me give you a few examples of some of the things going on out there, thing that are being done by people who are ethically challenged. Far too many people write articles and then on their byline they send the Internet surfer or reader of the article to a website that has a squeeze page. The squeeze page has no real information on it, rather it asks for their name and e-mail address.

If the would-be Internet surfer is unwise enough to type in their name and email address they will be spammed by e-mail, receiving various hard-sell marketing pieces. Then, if the Internet Surfer does decide to put in their e-mail address, the website grants them access and then takes them to the page with information about what they are selling, or their online marketing "make you a millionaire" scheme.

Generally, these are five page sales letters, with tons of testimonials of people you've never heard of, and may not actually exist, and all sorts of unsubstantiated earnings claims of how much money you will make if you give them $39.35 by way of PayPal, for this limited offer "Now!" And they will send you an E-book with a strategic plan of how you can duplicate what they are doing. The reality is whatever they are doing is questionable to begin with.

If you are going to do online article marketing please don't scrape the bottom of the barrel, there's just too much competition down there from what I can see. Please consider all this.

Source: http://ezinearticles.com/?Scraping-the-Bottom-of-the-Barrel---The-Perils-of-Online-Article-Marketing&id=2710103

Wednesday, 22 April 2015

How to Properly Scrape Windows During The Cleaning Process

Removing ordinary dirt such as dust, fingerprints, and oil from windows seem simple enough. However, sometimes, you may find stubborn caked-on dirt or debris on your windows that cannot be removed by standard window cleaning techniques such as scrubbing or using a squeegee. The best way to remove caked-on dirt on your windows is to scrape it off. Nonetheless, you have to be extra careful when you are scraping your windows, because they can be easily scratched and damaged. Here are a number of rules that you need to follow when you are scraping windows.

Rule No. 1: It is recommended that you use a professional window scraper to remove caked-on dirt and debris from your windows. This type of scraper is specially made for use on glass, and it comes with certain features that can prevent scratching and other kinds of damage.

Rule No. 2: It is important to inspect your window scraper before using it. Take a look at the blade of the scraper and make sure that it is not rusted. Also, it must not be bent or chipped off at the corners. If you are not certain whether the blade is in a good enough condition, you should just play it safe by using a new blade.

Rule No. 3: When you are working with a window scraper, always use forward plow-like scraping motions. Scrape forward and lift the scraper off the glass, and then scrape forward again. Try not to slide the scraper backwards, because you may trap debris under the blade when you do so. Consequently, the scraper may scratch the glass.

Rule No. 4: Be extra cautious when you are using a window scraper on tempered glass. Tempered glass may have raised imperfections, which make it more vulnerable to scratches. To find out if the window that you are scraping is made of tempered glass, you have to look for a label in one of its corners.

Window Scraping Procedures

Before you start scraping, you have to wet your window with soapy water first. Then, find out how the window scraper works by testing it in a corner. Scrape on the same spot three or four times in forward motion. If you find that the scraper is moving smoothly and not scratching the glass, you can continue to work on the rest of the window. On the other hand, if you feel as if the scraper is sliding on sandpaper, you have to stop scraping. This indicates that the glass may be flawed and have raised imperfections, and scraping will result in scratches.

After you have ascertained that it is safe to scrape your window, start working along the edges. It is best that you start scraping from the middle of an edge, moving towards the corners. Work in a one or two inch pattern, until all the edges of the glass are properly scraped. After that, scrape the rest of the window in a straight pattern of four or five inches, working from top to bottom. If you find that the window is beginning to dry while you are working, wet it with soapy water again.

Source: http://ezinearticles.com/?How-to-Properly-Scrape-Windows-During-The-Cleaning-Process&id=6592930

Sunday, 19 April 2015

What is HTML Scraping and how it works

There are many reasons why there may be a requirement to pull data or information from other sites, and usually the process begins after checking whether the site has an official API. There are very few people who are aware about the presence of structured data that is supported by every website automatically. We are basically talking about pulling data right from the HTML, also referred to as HTML scraping. This is an awesome way of gleaning data and information from third party websites.

Any webpage content that can be viewed can be scraped without any trouble. If there is any way provided by the website to the browser of the visitor to download content and use the same in a highly structured manner, in that case, accessing of the content programmatically is possible. HTML scraping works in an amazing manner.

Before indulging in HTML scraping, one can inspect the browser for network traffic. Site owners have a couple of tricks up their sleeve to thwart this access, but majority of them can be worked around.

Before moving on to how HTML scraping works, we must understand the reasons behind the same. Why is scraping needed? Once you get a satisfactory answer to this question, you can start looking for RSS or API feeds or various other traditional structured data forms. It is significant to understand that when compared with APIs, websites are more significant.

The most important advantage of the same is the maintenance of their websites where a lot of visitors visit rather than safeguarding structured data feeds. With Tweeter, the same has been publicly seen when it clamps down on the developer ecosystem. Many times, API feeds change or move without any prior warning. Many times, it can also be a deliberate attempt, but mostly, such issues or problems erupt as there is no authority or an organization that maintains or takes care of the structured data. It is rarely noticed, if the same gets severely mangled or goes offline. In case the website has certain issues or the website no longer works, the problem is more in the form of a ball in your court requiring dealing with the same without losing any time. api-comic-image

Rate limiting is another factor that needs a lot of thinking and in case of public websites, it virtually doesn’t exist. Besides some occasional sign up pages or captchas, many business websites fail to create and built defenses against any unwarranted automated access. Many times, a single website can be scraped for four hours straight without anyone noticing. There are chances that you would not be viewed under DDOS attack unless concurrent requests are being made by you. You will be seen just as an avid visitor or an enthusiast in the logs, that too, in case anyone is looking.

Another factor in HTML scraping is that one can easily access any website anonymously. Behavior tracking can be done with a few ways by the administrator of the website and this turns out to be beneficial if you want to privately gather the data. Many times, registration is imperative with APIs in order to get key and with any request being sent, this key also needs to be sent. But, in case of simple and straightforward HTTP requests, the visitor can stay anonymous besides cookies and IP address, which can again be spoofed.

The availability of HTML scraping is universal and there is no need to wait for the opening of the site for an API or for contacting anyone in the organization. One simply needs to spend some time and browse websites at a leisurely pace until the data you want is available and then find out the basic patterns to access the same.

Now you need to don a hat of a professional scraper and simply dive in. Initially, it may take some time to work up figuring out the way the data have been structured and the way it can be accessed just as we read APIs. If there is no documentation unlike APIs, you need to be a little more smart about it and use clever tricks.

Some of the most used tricks are

Data Fetching

The first thing that is required is data fetching. Find endpoints to begin with, that is the URLs that can help in returning the data that is required. If you are pretty sure about the data and the way it should be structured so as to match your requirements, you will require a particular subset for the same and later you can indulge in site browsing using the navigation tools.

GET Parameter

The URLs must be paid attention to and see the way it changes as you indulge in clicking between the sections and the way they divide into various subsections. Before starting, the other option that can be used is to straight away go to the search functionality of the site. Certain terms can be typed and the URL needs to be focused again for watching the changes on the basis of what is being searched. A GET parameter will be probably seen like q which changes on the basis of the search term used by you. Other GET parameters that are not being used can be removed from the URL until only the ones that are needed are left for data loading. Before a query string, there must always be a “?” beginning.

Now the time has come when you would have started to come across the data that you would like to see and want to access, but sometimes, there may be certain pagination issues that require to be dealt with. Due to these issues, you may not be able to see the data in its entirety. Single requests are kept away by many APIs as well from database slamming. Many times, clicking the next page can add some offset parameter that helps in data visibility on the page. All these steps will help you succeed in HTML scraping.

Source: https://www.promptcloud.com/blog/what-is-html-scraping-and-how-it-works/

Wednesday, 8 April 2015

Thoughts on scraping SERPs and APIs

Google says that scraping keyword rankings is against their policy from what I've read. Bummer. We comprise a lot of reports and manual finding and entry was a pain. Enter Moz! We still manually check and compare, but it's nice having that tool. I'm confused now though about practices and getting SERPs in an automated way. Here are my questions

Is it against policy to get SERPs from an automated method? If that is the case, isn't Moz breaking this policy with it's awesome keyword tracker?
If it's not, and we wanted to grab that kind of data, how would we do it? Right now, Moz's API doesn't offer this data. I thought Raven Tools at one point offered this, but they don't now from what I've read. Are there any APIs out there that we can grab this data and do what we want with it? (let's day build our own dashboard)?

Thanks for any clarification and input!

Source: http://moz.com/community/q/thoughts-on-scraping-serps-and-apis

Sunday, 5 April 2015

Some Traps to know and avoid in Web Scraping

In the present day and age, web scraping comes across as a handy tool in the right hands. In essence, web scraping means quickly crawling the web for specific information, using pre-written programs. Scraping efforts are designed to crawl and analyze the data of entire websites, and saving the parts that are needed. Many industries have successfully used web scraping to create massive banks of relevant, actionable data that they use on a daily basis to further their business interests and provide better service to customers. This is the age of the Big Data, and web scraping is one of the ways in which businesses can tap into this huge data repository and come up with relevant information that aids them in every way.

Web scraping, however, does come with its own share of problems and roadblocks. With every passing day, a growing number of websites are trying to actively minimize the instance of scraping and protect their own data to stay afloat in today’s situation of immense competition. There are several other complications which might arise and several traps that can slow you down during your web scraping pursuits. Knowing about these traps and how to avoid them can be of great help if you want to successfully accomplish your web scraping goals and get the amount of data that you require.

Complications in Web Scraping

Over time, various complications have risen in the field of web scraping. Many websites have started to get paranoid about data duplication and data security problems and have begun to protect their data in many ways. Some websites are not generally agreeable to the moral and ethical implications of web scraping, and do not want their content to be scraped. There are many places where website owners can set traps and roadblocks to slow down or stop web scraping activities. Major search engines also have a system in place to discourage scraping of search engine results. Last but not the least, many websites and web services announce a blanket ban on web scraping and say the same in their terms and conditions, potentially leading to legal issues in the event of any scraping.

Here are some of the most common complications that you might face during your web scraping efforts which you should be particularly aware about –

•    Some locations on the intranet might discourage web scraping to prevent data duplication or data theft.

•    Many websites have in place a number of different traps to detect and ban web scraping tools and programs.

•    Certain websites make it clear in their terms and conditions that they consider web scraping an infringement of their privacy and might even consider legal redress.

•    In a number of locations, simple measures are implemented to prevent non-human traffic to websites, making it difficult for web scraping tools to go on collecting data at a fast pace.

To surmount these difficulties, you need a deeper and more insightful understanding of the way web scraping works and also the attitude of website owners towards web scraping efforts. Most major issues can be subverted or quietly avoided if you maintain good working practice during your web scraping efforts and understand the mentality of the people whose sites you are scraping.

Web Crawling Services & Web Scraping Services

Common Problems

With automated scraping, you might face a number of common problems. The behavior of web scraping programs or spiders presents a certain picture to the target website. It then uses this behavior to distinguish between human users and web scraping spiders. Depending on that information, a website may or may not employ particular web scraping traps to stop your efforts. Some of the commonly employed traps are –

Crawling Pattern Checks – Some websites detect scraping activities by analyzing crawling patterns. Web scraping robots follow a distinct crawling pattern which incorporates repetitive tasks like visiting links and copying content. By carefully analyzing these patterns, websites can determine that they are being caused by a web scraping robot and not a human user, and can take preventive measures.

Honeypots – Some websites have honeypots in their webpages to detect and block web scraping activities. These can be in the form of links that are not visible to human users, being disguised in a certain way. Since your web crawler program does not operate the way a human user does, it can try and scrape information from that link. As a result, the website can detect the scraping effort and block the source IP addresses.

Policies – Some websites make it absolutely apparent in their terms and conditions that they are particularly averse to web scraping activities on their content. This can act as a deterrent and make you vulnerable against possible ethical and legal implications.

Infinite Loops – Your web scraping program can be tricked into visiting the same URL again and again by using certain URL building techniques.

These traps in web scraping can prove to be detrimental to your efforts and you need to find innovative and effective ways to surpass these problems. Learning some web crawler tips to avoid traps and judiciously using them is a great way of making sure that your web scraping requirements are met without any hassle.

What you can do

The first and foremost rule of thumb about web scraping is that you have to make your efforts as inconspicuous as possible. This way you will not arouse suspicion and negative behavior from your target websites. To this end, you need a well-designed web scraping program with a human touch. Such a program can operate in flexible ways so as to not alert website owners through the usual traffic criteria used to spot scraping tools.

Web scraping for ecommerce data extraction

Some of the measures that you can implement to ensure that you steer clear of common web scraping traps are –

•    The first thing that you need to do is to ascertain if a particular website that you are trying to scrape has any particular dislike towards web scraping tools. If you see any indication in their terms and conditions, tread cautiously and stop scraping their website if you receive any notification regarding their lack of approval. Being polite and honest can help you get away with a lot.

•    Try and minimize the load on every single website that you visit for scraping. Putting a high load on websites can alert them towards your intentions and often might cause them to develop a negative attitude. To decrease the overall load on a particular website, there are many techniques that you can employ.

•    Start by caching the pages that you have already crawled to ensure that you do not have to load them again.

•    Also store the URLs of crawled pages.

•    Take things slow and do not flood the website with multiple parallel requests that put a strain on their resources.

•    Handle your scraping in gentle phases and take only the content you require.

•    Your scraping spider should be able to diversify its actions, change its crawling pattern and present a polymorphic front to websites, so as not to cause an alarm and put them on the defensive.

•    Arrive at an optimum crawling speed, so as to not tax the resources and bandwidth of the target website. Use auto throttling mechanisms to optimize web traffic and put random breaks in between page requests, with the lowest possible number of concurrent requests that you can work with.

•    Use multiple IP addresses for your scraping efforts, or take advantage of proxy servers and VPN services. This will help to minimize the danger of getting trapped and blacklisted by a website.

•    Be prepared to understand the respect the express wishes and policies of a website regarding web scraping by taking a good look at the target ‘robots.txt’ file. This file contains clear instructions on the exact pages that you are allowed to crawl, and the requisite intervals between page requests. It might also specify that you use a pre-determined user agent identification string that classifies you as a scraping bot. adhering to these instructions minimizes the chance of getting on the bad side of website owners and risking bans.

Use an advanced tool for web scraping which can store and check data, URLs and patterns. Whether your web scraping needs are confined to one domain or spread over many, you need to appreciate that many website owners do not take kindly to scraping. The trick here is to ensure that you maintain industry best practices while extracting data from websites. This prevents any incident of misunderstanding, and allows you a clear pathway to most of the data sources that you want to leverage for your requirements.

Hope this article helps in understanding the different traps and roadblocks that you might face during your web scraping endeavors. This will help you in figuring out smart, sensible ways to work around them and make sure that your experience remains smooth. This way, you can keep receiving the important information that you need with web scraping. Following these basic guidelines can help you prevent getting banned or blacklisted and stay in the good books of website owners. This will allow you continue with your web scraping activities unencumbered.

Source:https://www.promptcloud.com/blog/some-traps-to-avoid-in-web-scraping/