Hardware, programming and astronomy tutorials and reviews.

Harvesting data from websites using WebKit and PyQt4 - part 2

Extracting data from parsed HTML code of a website with WebKit and PyQt4

In this tutorial we will create parsers that will extract ad urls from the parsed page source (DOM tree). For flash ads it's common to use a "clickTag" variable passed to the flash ad, which holds the URL on which the flash add will redirect. In case of advertising companies that use text or images as ads in their JS widgets we have to analyze the parsed HTML code to find a pattern allowing regex extraction of URLs. For example AdTaily.com widget puts HTML code like this:
<a style="position: relative; font-weight: normal; text-align: left; background-image: none; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: initial; padding-left: 0px; padding-right: 0px; padding-top: 0px; padding-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; margin-bottom: 10px; display: block; width: 125px; height: 125px; background-position: initial initial; " href="http://www.megiteam.pl/" title="Hosting nowych technologii" rel="nofollow" target="_blank">
Very unique set of styles means we can easily extract the url (urls only from adtaily widget)

AdTaily Parser

Flash ads - clickTag

Saving data to a database

In Qt databases are handled with QtSql component. We can use it in PyQt4, but the API won't be compatible with the standard DB API for Python modules.

Source code


PyQt and GUI, 10 November 2009,

Comment article