Harvesting data from websites using WebKit and PyQt4 - part 2
Extracting data from parsed HTML code of a website with WebKit and PyQt4In this tutorial we will create parsers that will extract ad urls from the parsed page source (DOM tree). For flash ads it's common to use a "clickTag" variable passed to the flash ad, which holds the URL on which the flash add will redirect. In case of advertising companies that use text or images as ads in their JS widgets we have to analyze the parsed HTML code to find a pattern allowing regex extraction of URLs. For example AdTaily.com widget puts HTML code like this: Very unique set of styles means we can easily extract the url (urls only from adtaily widget)
- In the folder where my app is located I've created parsers.py for the parsers.
- Below is the AdTaily parser - it get the full HTML of a webpage, and regex data out of it:
- Import parsers in run.py and modify loadFinished method:
- We pass the rendered HTML code of a loaded page to get_adtaily and print the list of URLs it found. If you run the app on a webpage that cointains AdTaily widget (like www.python.rk.edu.pl) then you should see urls of ads from this widget.
Flash ads - clickTag
- For flash ads "clickTag" (and "click", "gaadlink" etc.) is used to pass the redirect URL (through click counting script that redirects to the final site). The embed tag may look like this:
- Parser for "clickTag" will look like this:
- You have to check how URLs are passed to flash ads on sites which you want to harvest. (unquote is from urllib)
Saving data to a databaseIn Qt databases are handled with QtSql component. We can use it in PyQt4, but the API won't be compatible with the standard DB API for Python modules.
- Import the module:
from PyQt4.QtSql import *
- Connect to the database in __init__: We select the driver (SQLite), and next we specify database name and connect.
- We also need a table for the data. This one will be ok:
You can create this table using command line tools for SQLite:
- We have the database, so start inserting data: We create object query = QSqlQuery(self.db) and execute a query using exec_ method. The URL will be saved in the DB, but if an error occur - we will see the error message.
- Our example can be improved - by blocking duplicate links for current day: In this snippet we selected some data. To iterate over data use query.next(), which will return True if it got another row (while query.next() for many rows). In our case we select one row. Method query.value(index) will return result data as QVariant object, so we have to make it a string or int.
- In third part we will create second app, that will load saved urls and update their entries with final site title and url.