Making an offline file system copy of a Plone site using WebDAV

You might want to create an offline copy of a Plone site because

  • You are traveling and you want to have all files on the site (PDFs for reading)
  • You are taking a site down and making the final back-up
  • You just want to feel how cool Plone is

Plone supports WebDAV.

Creating a file system viewable offline copy of a Plone site is a task of

  • Enabling WebDAV
  • Login to site via WebDAV. On OSX use Cyberduck, Finder (the file browser of OSX itself) may have issues, though works. You might need Zope admin priviledges for certain operations.
  • Drag and drop Plone site to your hard disk

WebDAV copy process works smoothly

  • Folder and page structure is intact
  • Files are copied as is (think PDFs, Docs)
  • Images are copied as is
  • Pages (HTML) are converted to special files, which are still readable in plain-text editor

Note that some special folders (acl_users, reference_catalog, etc.) might be exposed through WebDAV, but they are not really copyable. Just ignore these during the copy process.

You can also use WebDAV to mass upload files and images for your image bank instead of manually uploading them through web interface.

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in plone, technology | Tagged , , , , , , , , , , , , | Leave a comment | Edit

RFC: Simple Internet Question Asking Protocol (for human beings)

This is my  attempt version 0.1 to teach the world how one should ask questions in the simplest possible way in Internet discussion. To make it simple, I try to keep this short. This post sprouts from my frustration from the lack of people’s ability to form questions one could easily answer.

Assumptions

If you want to ask a question in forum, IRC (chat) or mailing list

  1. Assume people are busy
  2. Assume that people want to help you, even though they are busy, since they volunteer to participate the community discussion and thus they must care about the community

To make it win-win situation, you as the question maker, are responsible of making the process of asking the question and answering the question as easy as possible. Form your question in such a way that it is as easy as possible for the readers to place themselves into your situation and think how they would themselves solve the situation (Mikko’s rule of empathy).

The less time it takes to undestand your situation the more likely people are willing to contribute their time.

Question process

Thus, I propose that you always follow the simple three steps when asking a question

  1. Before asking the question tell what you already know
  2. Describe the problem
  3. Ask what you do not know yet

Then wait patiently for the answer (the busy part).

Pitfalls

These issues often stem from the fact that the person asking the question is not familiar with text-based communication where people’s time (bandwidth) is limited and the lack of body gestures often leads to misinterpretations.

  1. Do not ask yes / no questions. You are skipping steps #1 and #3.
  2. Do not saturate the bandwidth: do not repeat yourself or otherwise flood the medium. If people are busy it it does not make them un-busy by repeating yourself. You are breaking the assumption #1.
  3. Do not try to pull excessive attention on you – do not try to highlight your question like “PLEASE HELP !!!!” Even if it is a matter of life and dead for you it is not for the other people who are dealing with their own matters of life and dead. You are breaking the assumption #2.

Example

Q: Is it possible to fly me to the Moon? A: Yes

Q: I am an evil super-villain whose plan overtake the world failed.  Now I must escape. I am looking for methods to take me to the Moon or the orbit where national laws to do not apply. I am not sure should I use a shuttle or a rocket. Where could I obtain such a vehicle?

A: US of A just retired one reliable space shuttle what you could use. But if I were you I’d consider underwater base instead, as they will become cheaper in long run, since you can more easily produce breathable oxygen.

More info

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in plone, python, technology | Tagged , , , , , , , , | Leave a comment | Edit

Enable PHP log output (error_log) on XAMPP on OSX

If you are using XAMPP to develop PHP software (WordPress, Joomla!) on OSX you might want to get some advanced logging output from your code. PHP provides nice error_log() function, but it is silent by default. Here are short instructions how to enable it and follow the log.

Use your favorite editor to edit php.ini file in /Applications/XAMPP/etc/php.ini – sudo priviledges needed, Smultron does it out of the box.

Change lines:

log_errors = Off
;error_log = filename

To:

log_errors = on
error_log = /tmp/php.log

Restart Apache using XAMPP controller in Finder -> Applications.

Now use the following UNIX command to see continuous log flow in your terminal:

tail -f /tmp/php.log

See also the earlier article about XAMPP and file permissions.

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in php, technology | Tagged , , , , , , | Leave a comment | Edit

Everyone loves and hates console.log()

console.log()  is the best friend of every Javascript junkie. However, the lack of it isn’t. console.log() function is only available in Webkit based browsers and with Firebug in Firefox. It’s the infamous situation that someone leaves console.log() to Javascript code, doesn’t notice its presence, commits the file and suddenly all Javascript on the production server stops working for Internet Explorer users….

To tackle the lack of console.log() problem there are several approaches.

Use dummy placeholder if console is missing

This snippet wraps console.log (need to repeat for console.error etc.):

// Ignore console on platforms where it is not available
if (typeof(window["console"]) == "undefined") { console = {}; console.log = function(a) {}; }

Pros

  • Easy

Cons

  • Need to add to every Javascript file
  • Messes with global namespace

Use module specific log function

This makes your code little bit ugly, more Java like. Each Javascript module declares their own log() function which checks the existence of console.log() and outputs there if it’s present.

mfabrik.log =function(x) {
 if(console.log) {
 console.log(x);
 }
}

mfabrik.log("My log messages")

Pros

  • Easy to hook other logg
  • You can disable all logging output with one if

Cons

  • Not as natural to write as console.log()
  • Need to add to every Javascript module

Preprocess Javascript files

Plone (Kukit / KSS) uses this approach. All debug Javascript is hidden behind conditional comments and it is filtered out when JS files are bundled for the production deployment. (The preprocessing code is here in Python for those who are interested in it).

if (_USE_BASE2) {
 // Base2 legacy version: matchAll has to be used
 // Base2 recent version: querySelectorAll has to be used
 var _USE_BASE2_LEGACY = (typeof(base2.DOM.Document.querySelectorAll) == 'undefined');
 if (! _USE_BASE2_LEGACY) {
 ;;;     kukit.log('Using cssQuery from base2.');

Pros

  • Makes production Javascript files lighter
  • Make production Javascript files more professional – you do not deliver logging statements indented for internal purposes for your site visitors

Cons

  • Complex – preprocessing is required

Commit hooks

You can use Subversion and Git commit hooks to check that committed JS files do not contain console.log. For example, Plone repositories do this for the Python statement  import pdb ; pdb.set_trace() (enforce pdb breakpoint).

Pros

  • Very robust approach – you cannot create code with console.log()

Cons

  • Prevents also legitimate use of console.log()
  • Github, for example, lacks possibility to push client-side commit hooks to the repository cloners. This means that every developer must manually install commit hooks themselves. Everything manual you need to do makes the process error prone.

Other approaches?

Please tell us!

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in javascript, technology | Tagged , , , , , , , , , , , , , , , | 1 Comment | Edit

Mirroring App Engine production data to development server using appcfg.py

Google App Engine provides some remote API functionality out of the box. One of the remote API features  is to download data from the development server. After downloading, then you can upload the downloaded data to your development server, effectively mirroring the content of the production server to your local development server. This is very useful if you are working CMS, sites, etc. where you want to test new layout or views locally against the old data before putting them to the production.

First enable remote API in app.yaml:

- url: /remote_api
  script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
  login: admin

Note: Using builtins app.yaml directive didn’t work for me some reason, so I had to specify remote API URI manually.

After this you should be able to download data. Here I am using appcfg.py global installation on OSX. Below is the command and sample output.

appcfg.py -e yourgoogleaccount@gmail.com download_data --url=http://yourappid.appspot.com/remote_api --filename=data.sqlite3
...
Downloading data records.
[INFO    ] Logging to bulkloader-log-20110313.222523
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 10
...
[INFO    ] Have 1803 entities, 0 previously transferred
[INFO    ] 1803 entities (972883 bytes) transferred in 91.0 seconds

data.sqlite3 is your production database dump in SQLite 3 binary format (used internally by the development server).

If you have sqlite command line tool installed you can explore around the data dump there:

sqlite3 data.sqlite
SQLite version 3.7.5
Enter ".help" for instructions
Enter SQL statements terminated with a ";"

sqlite> .tables
Apps                                   your-app!Model1!Entities
IdSeq                                  your-app!Model1!EntitiesByProperty
Namespaces                             your-app!Model2!Entities
bulkloader_database_signature          your-app!Model2!EntitiesByProperty
your-app!!Entities                     result
your-app!!EntitiesByProperty

Now you can upload data.

Note: Even though there exists option –use_sqlite for dev_appserver.py looks like it cannot directly use the database file produced by download_data. You cannot just swap database files, you need upload the downloaded data to the development server.

Start your development server:

dev_appserver.py .

In another terminal, go to downloaded data.sqlite folder and give the command:

appcfg.py upload_data --url http://localhost:8080/remote_api --file=data.sqlite --application=yourappid

It will ask you for credentials, but it seems that any username and password is accepted for the local development server.

Now you can login to your local development server to explore the data:

http://localhost:8080/_ah/admin

Ensure your data got copied over using Data Viewer:

http://localhost:8080/_ah/admin

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in appengine, python, technology | Tagged , , , , , , , | Leave a comment | Edit

Google App Engine: issues with dynamic instances and DeadlineExceededErrors

Dynamic instances and processing time

This Google App Engine feature came me as a surprise, though it makes perfect sense. Your site is slow if it has low traffic.

Google App Engine runs Python code on instances. By default, instances are dynamic. Instances are shutdown if they do not have enough traffic (requests per minute). Thus, when you get the individual hits to App Engine now and then, App Engine must restart your instance every time for each hit.

When this happens, you see the following in App Engine console logs for every request on low volume traffic:

This request caused a new process to be started for your application,
and thus caused your application code to be loaded for the first time.

It is not always ok to add 500 – 2000 milliseconds processing delay on the top of the normal processing time. Google’s own recommendation was that each page should be served within 200 milliseconds.

There are three ways to optimize this issue

  • Use App Engine premium feature “Always on” 0,30 $ / day which keeps your instance always running
  • Use cron job or such to keep your instance alive (polling once in a minute seems to do the job)
  • Optimize your imports and split your code to several modules with light amount of imports, so that start up is fast (modules are imported only once)

We are using Zabbix software to monitor our sites (sidenote: I don’t recommend Zabbix as the first monitoring software choice as it is very difficult to use and has bad user experience, alienating both sysadmins and developers away from it). This is what we had before optimizations – App Engine was starting a new process for every request:

… and this is output we got after optimizations:

Here is the corresponding diagram after optimizations from App Engine dashboard itself. These processing times are without network latency. As far as I know Google does not expose the endpoints of App Engine hosting, so you don’t know from which site of the world your responses come from. By comparing this diagram to the diagram above, you can see how Internet traffic is affecting to your App Engine application.

The PITA of dying instances

For some reason, App Engine instances misbehave sometimes. This causes the HTTP requests die ungracefully.

Normally it is not a problem as you lost few page loads now and then. People are used to “Internet grade” service and can hit the refresh button if they have problems opening a page.

However if you are monitoring your site and the site gives an unnecessary alarm in the middle of the night, waking up your bastard operator from Hell, he will be very angry next morning and tell you to migrate the crappy software from unreliable Python / App Engine to more reliable PHP servers :(

This is what you see in App Engine logs:

A serious problem was encountered with the process that handled this request, causing it to exit.
This is likely to cause a new process to be used for the next request to your application.
If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)

After digging in deeper, you see that it is a problem of instating a new object in the database, exceeding 30 seconds hard limit for processing a HTTP request:

2011-03-09 05:06:20.794 / 500 30094ms 86cpu_ms 40api_cpu_ms
0kb Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
.NET CLR 2.0.50727),gzip(gfe),gzip(gfe),gzip(gfe)

<class 'google.appengine.runtime.DeadlineExceededError'>:
Traceback (most recent call last):
  File "/base/data/home/apps/mfabrikkampagne/1.347249742610459821/main.py", line 494, in main
    run_wsgi_app(application)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/util.py", line 97, in run_wsgi_app
    run_bare_wsgi_app(add_wsgi_middleware(application))
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/util.py", line 115, in run_bare_wsgi_app
    result = application(env, _start_response)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 515, in __call__
    handler.get(*groups)
  File "/base/data/home/apps/mfabrikkampagne/1.347249742610459821/main.py", line 296, in get
    try: self.session = Session()

So it looks like there is a temporary hick-up in Google App Engine’s Data Store (Big Table?). In the example above the error comes from gaeutilities‘s Session model, but it could be any other model.

It is possible to catch DeadlineExceededError and temporarily work-around it, as shown in App Engine documentation.

The best way to handle this situation is to adjust your monitoring software – Zabbix in our case. Zabbix allows you to configure triggers so that they don’t alarm on every bad item state change. Instead, you can use min() function and trigger the alarm after the trigger condition has failed every time during a monitoring period. Just make sure that the trigger period is at least twice long as the update interval of your web scenario: this way Zabbix can logs at least two item state changes and allows one of them to be failed one.

For example if

  • Update interval of web scenario is 60 seconds
  • Trigger function must check minimal failures of 1 during 2*60 seconds + some buffer = 150 seconds.
{xxx.fi:web.test.fail[de.mfabrik.com].min(150)}=1

This will allow one failed response before triggering the alarm.

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in appengine, python, technology | Tagged , , , , , , , , , , , , , , , , | 1 Comment | Edit

Visual Studio and Microsoft go Python

Microsoft Technical Computing Group has released a beta version for its Python integration for Visual Studio.

This is, indeed, interesting development, as it clearly shows that Python has reached a new level of  programming language maturity.  Receiving this much of attention from mighty Microsoft means that Python is no longer a mere prospect member in the cabin of enterprise solutions.

Python Tools for Visual Studio are not focused only on Microsoft’s own .NET run-time: even Jython and PyPy are partially supported, claims the spec sheet. Looks like some kind of cloud integration is on its way – maybe Microsoft wants to challenge Google App Engine by providing even better cloud development tools?

Also there seems to be more information coming in PyCon…

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in python, technology | Tagged , , , , , , , | Leave a comment | Edit

How to render a portlet in Plone

It’s easy :) It took me only two years to figure this out.

Below is an example how to render a portlet in Plone programmatically. This is useful when you want to have special page layouts and you need to include portlet output from another part of the site.

  • Portlet machinery uses Zope’s adapter pattern extensively. This allows you to override things based on the content context, HTTP request, etc.
  • A portlet is assigned to some context in some portlet manager
  • We can dig these assignments up by portlet assignment id (not user visible) or portlet type (portlet assignment interface)
  • Each portlet has its own overrideable renderer class

This all makes everything flexible, though still not flexible enough for some use cases (blacklisting portlets). The downside is that accessing things through many abstraction layers and plug-in points (adaptions) is little cumbersome.

Here is sample code for digging up a portlet and calling its renderer:

        import Acquisition
        from zope.component import getUtility, getMultiAdapter, queryMultiAdapter
        from plone.portlets.interfaces import IPortletRetriever, IPortletManager, IPortletRenderer

        def get_portlet_manager(column):
            """ Return one of default Plone portlet managers.

            @param column: "plone.leftcolumn" or "plone.rightcolumn"

            @return: plone.portlets.interfaces.IPortletManagerRenderer instance
            """
            manager = getUtility(IPortletManager, name=column)
            return manager

        def render_portlet(context, request, view, manager, interface):
            """ Render a portlet defined in external location.

            .. note ::

                Portlets can be idenfied by id (not user visible)
                or interface (portlet class). This method supports look up
                by interface and will return the first matching portlet with this interface.

            @param context: Content item reference where portlet appear

            @param manager: IPortletManagerRenderer instance

            @param view: Current view or None if not available

            @param interface: Marker interface class we use to identify the portlet. E.g. IFacebookPortlet 

            @return: Rendered portlet HTML as a string, or empty string if portlet not found
            """    

            retriever = getMultiAdapter((context, manager), IPortletRetriever)

            portlets = retriever.getPortlets()

            assignment = None

            for portlet in portlets:

                # portlet is {'category': 'context', 'assignment': , 'name': u'facebook-like-box', 'key': '/isleofback/sisalto/huvit-ja-harrasteet
                # Identify portlet by interface provided by assignment
                if interface.providedBy(portlet["assignment"]):
                    assignment = portlet["assignment"]
                    break

            if assignment is None:
                # Did not find a portlet
                return ""

            #- A special type of content provider, IPortletRenderer, knows how to render each
            #type of portlet. The IPortletRenderer should be a multi-adapter from
            #(context, request, view, portlet manager, data provider).

            renderer = queryMultiAdapter((context, request, view, manager, assignment), IPortletRenderer)

            # Make sure we have working acquisition chain
            renderer = renderer.__of__(context)

            if renderer is None:
                raise RuntimeError("No portlet renderer found for portlet assignment:" + str(assignment))

            renderer.update()
            # Does not check visibility here... force render always
            html = renderer.render()

            return html

This is how you integrate it to your view class:

    def render_slope_info(self):
        """ Render a portlet from another page in-line to this page 

        Does not render other portlets in the same portlet manager.
        """
        context = self.context.aq_inner
        request = self.request
        view = self

        column = "isleofback.app.frontpageportlets"

        # Our custom interface marking a portlet
        from isleofback.app.portlets.slopeinfo import ISlopeInfo

        manager = get_portlet_manager(column)

        html = render_portlet(context, request, view, manager, ISlopeInfo)
        return html

…and this is how you call your view helper method from TAL page template:

        <div tal:replace="structure view/render_slope_info" />

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in technology | Tagged , , , , , , | Leave a comment | Edit

Lazily load elements becoming visible using jQuery

It is a useful trick to lazily load comments or such elements at the bottom of page. Some elements may be loaded only when they are scrolled visible.

  • All users are not interested in the information and do not necessary read the article long enough to see it
  • By lazily loading such elements one can speed up the initial page load time
  • You save bandwidth
  • If you use AJAX for the dynamic elements of the page you can more easily cache your pages in static page cache (Varnish) even if the pages contain personalized bits

For example, Disqus is doing this (see comments in jQuery API documentation).

You can achieve this with in-view plug-in for jQuery.

Below is an example for Plone triggering productappreciation_view loading when our placeholder div tag becomes visible.

...
<head>
  <script type="text/javascript" tal:attributes="src string:${portal_url}/++resource++your.app/in-view.js"></script>
</head>
...
<div id="comment-placefolder">

 <!-- Display spinning AJAX indicator gif until our AJAX call completes -->

 <p>
 <!-- Image is in Products.CMFPlone/skins/plone_images -->
 <img tal:attributes="src string:${context/@@plone_portal_state/portal_url}/spinner.gif" /> Loading comments
 </p>

 <!-- Hidden link to a view URL which will render the view containing the snippet for comments -->                       
 <a rel="nofollow" style="display:none" tal:attributes="href string:${context/absolute_url}/productappreciation_view" />

 <script>

 jq(document).ready(function() {

   // http://remysharp.com/2009/01/26/element-in-view-event-plugin/                                        
   jq("#comment-placeholder").bind("inview", function() {

     // This function is executed when the placeholder becomes visible

     // Extract URL from HTML page
     var commentURL = jq("#comment-placeholder a").attr("href");

     if (commentURL) {
     // Trigger AJAX call
       jq("#comment-placeholder").load(commentURL);
     }

   });                                     

 });     
 </script>
</div>

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in plone, technology | Tagged , , , , , , , , , | 4 Comments | Edit

Installing and using Scrapy web crawler to search text on multiple sites

Here is a little script to use Scrapy, a web crawling framework for Python, to search sites for references for certain texts including link content and PDFs. This is handy for cases where you need to find links violating the user policy,  trademarks which are not allowed or just to see where your template output is being used.  Our Scrapy example differs from a normal search engine as it does HTML source code level checking: you can also search for CSS classes, link targets and other elements which may be invisible for normal search engines.

Scrapy comes with a command-line tool and project skeleton generator. You need to generate your own Scrapy project to where you can then add your own spider classes.

Install Scrapy using Distribute (or setuptools):

easy_install Scrapy

Create project code skeleton:

scrapy startproject myscraper

Add your spider class skeleton by creating a file myscraper/spiders/spiders.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
 """ Crawl through web sites you specify """

 name = "mycrawler"

 # Stay within these domains when crawling
 allowed_domains = ["www.mysite.com"]

 start_urls = [
 "http://www.mysite.com/",
 ]

 # Add our callback which will be called for every found link
 rules = [
   Rule(SgmlLinkExtractor(), follow=True)
 ]

Start Scrapy to test it’s crawling properly. Run the following the top level directoty:

scrapy crawl mycrawler

You should see output like:

2011-03-08 15:25:52+0200 [scrapy] INFO: Scrapy 0.12.0.2538 started (bot: myscraper)
2011-03-08 15:25:52+0200 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2011-03-08 15:25:52+0200 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware

You can hit CTRL+C to interrupt scrapy.

Then let’s enhance the spider a bit to search for a blacklisted tags, with optional whitelisting in myscraper/spiders/spiders.py. We use also pyPdf library to crawl inside PDF files:

"""

        A sample crawler for seeking a text on sites.

"""

import StringIO

from functools import partial

from scrapy.http import Request

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.item import Item

def find_all_substrings(string, sub):
    """

http://code.activestate.com/recipes/499314-find-all-indices-of-a-substring-in-a-given-string/

    """
    import re
    starts = [match.start() for match in re.finditer(re.escape(sub), string)]
    return starts

class MySpider(CrawlSpider):
    """ Crawl through web sites you specify """

    name = "mycrawler"

    # Stay within these domains when crawling
    allowed_domains = ["www.mysite.com", "www.mysite2.com", "intranet.mysite.com"]

    start_urls = [
        "http://www.mysite.com/",
        "http://www.mysite2.com/",
        "http://intranet.mysite.com/"
    ]

    # Add our callback which will be called for every found link
    rules = [
        Rule(SgmlLinkExtractor(), follow=True, callback="check_violations")
    ]

    # How many pages crawled? XXX: Was not sure if CrawlSpider is a singleton class
    crawl_count = 0

    # How many text matches we have found
    violations = 0

    def get_pdf_text(self, response):
        """ Peek inside PDF to check possible violations.

        @return: PDF content as searcable plain-text string
        """

        try:
                from pyPdf import PdfFileReader
        except ImportError:
                print "Needed: easy_install pyPdf"
                raise 

        stream = StringIO.StringIO(response.body)
        reader = PdfFileReader(stream)

        text = u""

        if reader.getDocumentInfo().title:
                # Title is optional, may be None
                text += reader.getDocumentInfo().title

        for page in reader.pages:
                # XXX: Does handle unicode properly?
                text += page.extractText()

        return text                                      

    def check_violations(self, response):
        """ Check a server response page (file) for possible violations """

        # Do some user visible status reporting
        self.__class__.crawl_count += 1

        crawl_count = self.__class__.crawl_count
        if crawl_count % 100 == 0:
                # Print some progress output
                print "Crawled %d pages" % crawl_count

        # Entries which are not allowed to appear in content.
        # These are case-sensitive
        blacklist = ["meat", "ham" ]

        # Enteries which are allowed to appear. They are usually
        # non-human visible data, like CSS classes, and may not be interesting business wise
        exceptions_after = [ "meatball",
                             "hamming",
                             "hamburg"
                     ]

        # These are predencing string where our match is allowed
        exceptions_before = [
                "bushmeat",
                "honeybaked ham"
        ]

        url = response.url

        # Check response content type to identify what kind of payload this link target is
        ct = response.headers.get("content-type", "").lower()
        if "pdf" in ct:
                # Assume a PDF file
                data = self.get_pdf_text(response)
        else:
                # Assume it's HTML
                data = response.body

        # Go through our search goals to identify any "bad" text on the page
        for tag in blacklist:

                substrings = find_all_substrings(data, tag)

                # Check entries against the exception list for "allowed" special cases
                for pos in substrings:
                        ok = False
                        for exception in exceptions_after:
                                sample = data[pos:pos+len(exception)]
                                if sample == exception:
                                        #print "Was whitelisted special case:" + sample
                                        ok = True
                                        break

                        for exception in exceptions_before:
                                sample = data[pos - len(exception) + len(tag): pos+len(tag) ]
                                #print "For %s got sample %s" % (exception, sample)
                                if sample == exception:
                                        #print "Was whitelisted special case:" + sample
                                        ok = True
                                        break
                        if not ok:
                                self.__class__.violations += 1
                                print "Violation number %d" % self.__class__.violations
                                print "URL %s" % url
                                print "Violating text:" + tag
                                print "Position:" + str(pos)
                                piece = data[pos-40:pos+40].encode("utf-8")
                                print "Sample text around position:" + piece.replace("\n", " ")
                                print "------"

        # We are not actually storing any data, return dummy item
        return Item()

    def _requests_to_follow(self, response):

        if getattr(response, "encoding", None) != None:
                # Server does not set encoding for binary files
                # Do not try to follow links in
                # binary data, as this will break Scrapy
                return CrawlSpider._requests_to_follow(self, response)
        else:
                return []

Let’s tune down logging output level, so we get only relevant data in the output. In myscaper/settings.py add:

LOG_LEVEL="INFO"

Now you can run the crawler and pipe the output to a text file:

scrapy crawl mycrawler > violations.txt

More information

Get developers  Subscribe mFabrik blog in a reader Follow me on Twitter

Posted in python, technology | Tagged , , , , , , , , , , , , , , | Leave a comment | Edit