Harvesting data from websites using WebKit and PyQt4 - part 3
Gathering final data for ad urls (final page title and URL) using PyQt4 with QtSql module.In this tutorial we will make second app, that will get final data for ads URLs (title and URL). The application has lot load urls from the database and then update the entry.
Ad URL ParserThe GUI looks like this:
- "Start" (startButton, pushButton widget), which starts URLs loading
- progressBar - how many URLs done
- webView - widget for loading pages
pyuic4 parser.ui > parser.pyAnd create run.py with the skeleton code: It will start the app, but won't do anything. We have to add webpage loader: We have some new things here:
- I've created a class called FakeBrowser which inherits QWebPage and overwrite userAgentForUrl method. In __init__ I've used my class instead of standard QWebPage. By doing so I've changed the USER AGENT of the browser (it's good to mark this app as bot, as no one likes cheated ads clicks).
- In __init__ we get also URIs to load and we put them in self.URIs
- Method __getNextUrl returns (last) element of that list if it exists
- Method loadFinished runs after page is loaded. We get the final page title, URL and update the DB row (using ID from self.nexturl). Note: you may not get all titles (ad url invalid, inactive etc.)