reading-notes

reading notes for code fellows

Project maintained by dLeigh01 Hosted on GitHub Pages — Theme by mattgraham

Web Scraping

Web scraping is interesting in that I can definitely see how it would be useful in general, but I will have to see what use we find for it in class terms.

Web Scrape with Python in 4 Minutes

Web scraping is a technique to automatically access and extract large amounts of information from a website. Whenever you’re planning to use web scraping, it is important to read through a website’s terms and conditions to know how you can legally use the data and to be sure you aren’t downloading the data too quickly because you could break the site and potentially be blocked. In preperation to scrape a site, you need to find out what HTML tags the information you want will be contained in, then you will want to import requests urllib.requests time and BeautifulSoup from bs4. Then you will use requests.get(url) to retrieve the site data. If you get back a response 200, your request was successful. After this, you can use BeautifulSoup to parse out the data and make it easier to read and use find all to get the data within the tags you found previously. Once you have this data, you can extract the data from the links that you want and you should add a time.sleep(1) to pause the code for a second after each request and avoid being flagged as a spammer.

What is Web Scraping?

Web scraping is used to extract data from websites; it can directly access the world wide web using the hypertest transfer protocol or a web browser. It can be performed manually, but is usually done by an automated process and is a form of copying in which specific data is gathered from the web for later analysis. The process involves fetching and extracting data from a site, and a major portion of this is the act of web crawling, or fetching pages for later processing. Most often, web scraping is used for contact scarping, web indexing, web mining, data mining, online price change monitoring, product review scraping, gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence, web mashup, and web data integration. Many sites prevent web scraping by disallowing bots from viewing their pages, which in response has prompted systems that attempt to simulate human browsing in order to gather web page content for offline parsing.

How to Scrape Websites Without Getting Blocked

Some sites will have a robots.txt file. It is best practice to follow this while scraping as it has specific rules for good behavior. If it contains lines such as User-agent: * or Disallow:/ it means the site does not want to be scraped. However, since many sites that don’t allow scraping from bots still want to be scraped by google, you theoretically can scrape their data if you need it, if you can keep it from realizing you are using a scraper. One way to keep from being blocked is to crawl slowly and not slam the server with requests. You can even put random sleep calls in between requests to appear more human and use auto-throttling mechanisms to slow down the crawling speed based on the load of the spider and the site. Try not to follow the same crawling pattern, sites can detect spiders by the way they move if left on default, so you should incorporate random clicks and mouse movements to appear more human. If you’re scraping a site can see your IP address. Multiple requests from the same IP can get you blocked, so you should use proxies to make it harder for them to detect the requests are coming from the same place. You should also set a user-agent that is changeable, as that informs the site what browser you are using, and too many requests from the same spot will also get you blocked. Don’t send cookies uless the scraper depends on them to function. You should also install some functionality to detect honey traps, which are links invisible to human users that will be hit by bots. Avoid scraping data on a site that requires a login, as each viewing will require the scraper to send data and make it easier to detect.

You can tell if you have been blocked or banned if you see captcha pages, unusual delays, or frequent error messages.

Things I Want to Know More About

Would you have a web scraper constantly running off of a backend server rather than an application, since those turn on and off?

Discussion

This sort of thing is exactly what I was trying to learn to do a few years ago when my friend and I were doing what we could to hunt down a switch for her when they were all gone. I knew they existed and I could theoretically create one, but I was never sure how and couldn’t find something that informed me of how. This is excellent for checking when things go in stock or for figuring out whether a new posting somewhere is in a price range that seems legitimate.

[< table of contents]

[< home]