Android,iOS,Gadgets,Reviews Everything About Technology

- Advertisement -

Parsing Resources with Python

72

I want to note that the work on this article is not finished yet. If you have comments or an addition, welcome to the comments.

Important

- Advertisement -

Always first see if the site offers its own API, RSS / Atom feeds will also be useful.

Requirements

We will use two additional libraries for Python.

Inquiries

We will use the requests library instead of urllib2 , since it is superior to urllib2 in every respect. I could prove it for a long time, but, it seems to me, on the page of this library everything is said in one paragraph:

The urllib2 library offers us most of the necessary aspects for working with HTTP, but the API leaves much to be desired. It was created at another time and for another internet network. It requires an incredible amount of work, even for simple tasks.

LXML

LXML is an XML \ HTML parser, we will use it to extract the information we need from web pages. Some prefer BeautifulSoup , but I’m pretty familiar with writing my own xPath and try to use Ixml to achieve maximum speed.

Crowling and Parsing

Crawling is the most common parser, only the link to the next page you get with the one on which you are currently (ie, the search robot, for example, went to the site and got all the links, then moved on them, then again received the rest of the links and went over them, etc.). Crawling is used to create large databases.

Examples

  • Crawling categories on the ebay site (by the way they have an API)
  • Crowling of contact data from telephone directories

Save scanned addresses

Looked addresses should be kept so that they do not have to be parsed several times. For processing no more than 50000 addresses, I would advise using set (set, data type). You can simply see if there is an address already in the set before adding it to the queue for parsing.

If you are going to parse a large site, then the previous method will not help. A set of links will simply reach large sizes and will use too much memory. The solution is Bloom’s File . We will not delve deeper into details, but this filter simply allows us to store visited addresses, and to the question “Was I already at this address?” He is very likely to give the correct answer, while using a very small amount of memory. pybloomfiltermmap is a very good implementation of Bloom Filter in Python.

Important: Be sure to normalize the URL before adding it to the filter. Depending on the site, you may need to delete all parameters in the URL. It’s unlikely you want to keep the same address several times.

CSS selectors

If you have experience with javascript, then it will probably be more convenient for you to do selections from the DOM using CSS selectors than using xPath. Ixml library will help you in this, it has a method css_to_xpath. Here is a quick guide:http://lxml.de/cssselect.html

I would also recommend PyQuery , as this is also a very common tool.

Extract the main content block

Sometimes it is necessary to analyze only the main content of the page, for example, to calculate the number of words. In this case, you can exclude the navigation menu or the side menu.

You can use readability-lxml for this purpose. Its main task is to extract the main content from the page (html can still be contained in the extracted content). Continuing the task of counting words, you can remove html with clean_html from lxml.html.clean, then break the result into spaces and count.

Parallelism

Everyone wants to get data as quickly as possible, but by running 200 concurrent requests, you will only anger the owner of the host. You should not do this. Do everyone a favor, do not allow more than 5 simultaneous requests.

To run concurrent requests, use the library grequests . Just a couple of lines are enough to parallelize your code.

rs = (grequests.get(u) for u in urls)  
responses = grequests.map(rs) 

Robots.txt

All possible guidelines say that you need to follow the rules of the Robots.txt file. But we will try to distinguish ourselves. The fact is that these rules are usually very restrictive and do not reflect the real limitations of the site. Most often, this file generates the default framework itself or some plugin for search engine optimization.

- Advertisement -

Usually the exception is the case if you write an ordinary parser, for example like Google or Bing. In this case, I would advise paying attention to the robots.txt file to avoid unnecessary delay of the parser. If the owner of the host does not want you to index some content, then you should not do this.

How not to be noticed?

Do not always want your actions to be seen. In fact, this is quite easy to achieve. To start, you should change the values ​​of the user agent in random order. Your IP address will not change, but this often happens on the network. Most large organizations, such as universities, use one IP address to organize Internet access throughout their local network.

The next tool is a proxy server. There are quite a few inexpensive proxy servers. If you run your queries from one hundred different addresses, constantly change the user agent and then it all comes from different date centers, then it will be difficult to track you. Here is a good server – My Private Proxy .

It came to the point that by the time the proxy servers I used were blocked by the hosts, the servers themselves had already changed their IP addresses to new ones.

Inherit the class Response

I would advise always to inherit the requests.model.Response class , because it contains very useful methods.

Frequent problems

Installing LXML

Quite often there are problems when installing Ixml, especially in Linux. It’s just that you first need to establish the dependencies:

apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev

and after that you can execute the command pip install lxmlin the normal mode.

How to parse sites with Ajax

Usually Ajax simplifies parsing. Using the Networking tab in Google Chrome, you can view ajax queries, as well as the responses to them, which usually come in JSON. It turns out that the page does not even need to parse, you can just send a request directly to the same address. If this does not work, then you should use the browser itself. Read the section “The site gives other content to the parser” for more information.

The site issues other content to the parser

  1. Check the source of the page and make sure that the data you are looking for is actually there. It is possible that the content has been changed with javascript, and our parser does not perform such actions. In that case, there is only one way out: PhantomJS . You can control it with Python, just use selenium  or splinter .
  2. Often sites give different content to different browsers. Verify the settings of the user agent of your browser and the parser.
  3. Sometimes sites differentiate your geographic location and change their content based on it. If you run your script from a third-party server, this can affect the content of the response from the server.

Pagination

Sometimes you have to parse data broken down by pages. Fortunately, this is also quite easy to get around. Look closely at the URL of the address, as a rule, there is a page or offset parameter, which is responsible for the page number. Here is an example of controlling such a parameter:

base_url = "http://example.com/something/?page=%s"  
for url in [base_url % i for i in xrange(10)]:  
    r = requests.get(url)

Error 403 / Frequency limit

The solution is simple, just stay a little behind this site. Be careful with the number of concurrent requests. Do you really need this data for an hour, maybe you should wait a bit?

If all the same it is so necessary, then you should turn to the services of proxy servers. Just create a regular proxy manager that will monitor that each proxy server is not used more often than through the interval X. Note the sections “How not to be seen?” And “Parallelism”

Code Examples

The following examples alone are not of much use, but they are very useful as references.

Example of a simple query

I’m not going to show much in this example. All you need is in the documentation for the appropriate class .

import requests

response = requests.get('http://google.com/')

# Reply
print response.status_code # Response code  
print response.headers # Response Headers
print response.content # Response body

# Inquiry
print response.request.headers # Headers sent with request

Request with proxy

import requests

proxy = {'http' : 'http://102.32.33.1:8080',  
           'https': 'http://102.32.33.1:4444'}
response = requests.get('http://google.com/', proxies=proxy) 

Parsing the contents of the response

It’s fairly easy to parse the html code obtained with lxml. Once we converted the data to a tree, you can use xPath to retrieve the data.

import requests  
from lxml import html

response = requests.get('http://google.com/')

# Convert a document body to an element tree (DOM)
parsed_body = html.fromstring(response.text)

# Executing xpath in the tree of elements
print parsed_body.xpath('//title/text()') # Get title page 
print parsed_body.xpath('//a/@href') # Get the href attribute for all links

Download all images from the page

The following script downloads all images and stores them in the downloaded_images / . Just do not forget to create the appropriate directory first.

import requests  
from lxml import html  
import sys  
import urlparse

response = requests.get('http://imgur.com/')  
parsed_body = html.fromstring(response.text)

# Parsim links with pictures
images = parsed_body.xpath('//img/@src')  
if not images:  
    sys.exit("Found No Images")

# Converting all relative links to absolute
images = [urlparse.urljoin(response.url, url) for url in images]  
print 'Found %s images' % len(images)

# Download only the first 10
for url in images[0:10]  
    r = requests.get(url)
    f = open('downloaded_images/%s' % url.split('/')[-1], 'w')
    f.write(r.content)
    f.close()

Parser per stream

This parser just shows the basic principle of construction. It uses a queue to extract data to the left of the queue, and parse the pages as they are discovered. Nothing useful does this script, it simply outputs the page title.

import requests  
from lxml import html  
import urlparse  
import collections

STARTING_URL = 'http://google.com/'

urls_queue = collections.deque()  
urls_queue.append(STARTING_URL)  
found_urls = set()  
found_urls.add(STARTING_URL)

while len(urls_queue):  
    url = urls_queue.popleft()

    response = requests.get(url)
    parsed_body = html.fromstring(response.content)

    # Print the page title
    print parsed_body.xpath('//title/text()')

    # Looks for all references
    links = [urlparse.urljoin(response.url, url) for url in parsed_body.xpath('//a/@href')]
    for link in links:
        found_urls.add(link)

        # Add to the queue, only if the link HTTP/HTTPS
        if url not in found_urls and url.startswith('http'):
            urls_queue.append(link)

- Advertisement -

Comments