Pulling from other scrapers, not returning everything

Good Afternoon folks.

I just built a scraper to pull together the output of a bunch of other scrapers, all with the exact same column structure.

It works, but isn’t pulling everything - It is discarding at least 400 rows. Pri key is unique.
Any ideas why?


It is basically the example from the morph documentation:

$apikey = getenv('MORPH_API_KEY');
$query="select * from 'data'";

$url = "https://api.morph.io/lowndsy/PA-v6/data.json";
foreach ($js as $line)
scraperwiki::save(array('prikey'), $line);

$url = "https://api.morph.io/lowndsy/PA-v7/data.json";
foreach ($js as $line)
scraperwiki::save(array('prikey'), $line);



I’m not sure where “at least 400 rows” comes from.

I’m seeing 526 in https://morph.io/lowndsy/PA-v7 and 393 in https://morph.io/lowndsy/PA-v6 - 919 total. https://morph.io/lowndsy/PA_v1 mentions 396 and 526 for 922 total, and the sqlite has 820 rows… so that’s only about 100 missing?

I notice that the prikey field on PA-v6 looks weird (it’s strings like 18-06-2019 15:14AppleiPhone 8 dealsUpfront costFreeTotal cost£936.00Contract length24 monthsEE via Fonehouse which looks like it’s the whole row, rather than a hash for the row). I’m wondering if that could somehow be related?

I can’t see anything else that’s obviously wrong.

Sorry, those things you noticed were just me messing around to try to fix it after I made that post. I also cut down the amount of scrapers it tries to retrieve and added a cumulative number to the output. At the moment it fetches two scrapers and expects 919 entries, but only commits 819 to the database.

It can’t be an issue with the actual data being used because it work fine when I empty the db and retrieve fetch only one scraper, and the data came directly out of Morph anyway.

I tried modifying it to fetch one scraper at a time and noticed something odd going on with primary keys.

When I fetch just one scraper at a time the first run commits everything (obviously), but then when I change the target to the next scraper and re-run it fetches some and says the rest have updated existing records. The primary keys are totally unique so that shouldn’t be possible.

I understand why they are required but in this case it would be better if I didn’t have to declare a primary key at all.

edit: My primary key is a md5 hash of lots of key details - you saw the unhashed version earlier. Could some of the MD5s contain characters that Morph doesn’t like?

I’m not very familiar with PHP, or the php scraperwiki library.

Looking at the code I’m wondering if it’s possible that some of the MD5 hashes are getting interpreted as DateTime values.

The next thing that catches my eye is the string interpolation used to create the SQL query here.

To get more detail about what’s happening, you can call R::debug();, which will make the SQL library print out debugging info about what it’s doing. I’ve pasted some sample output below, and I’ve got a sample at https://morph.io/jamezpolley/PA_v1.

Morph will only capture the first 10,000 lines of output; it looks like this is generating about 25-30 lines of output for every record, so you might have to get a bit creative to get useful output that shows the records which are being skipped. If you can do anything to narrow it down (maybe look at the records that make it through and drop those from the source?) that should help.

 SELECT name FROM sqlite_master
 			WHERE type='table' AND name!='sqlite_sequence';<br>
 resultset: 1 rows<br>
 INSERT or REPLACE INTO data (id, datestamp, page, phone, position, dealfeatures, dealheadlines, data, monthly, upfront, total, contract, networkbrand, fullcode, prikey, notes) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)<br>
     [0] => 1350
     [1] => 19-06-2019 09:33
     [2] => https://www.uswitch.com/mobiles/samsung-galaxy-note-9-deals/?variant=128gb-black&data=30000&monthly_cost=0-40&upfront_cost=0-150&sort_by=monthly_cost&resellers=true&networks=ee
     [3] => SamsungGalaxy Note 9 deals
     [4] => 6
     [5] => Upfront costFreeTotal cost£804.00Contract length24 months£60 cashback 
     [6] =>  30GB data £ 36.00 per month 
     [7] =>  30GB data 
     [8] => £ 36.00 per month 
     [9] => Upfront costFree
     [10] => Total cost£804.00
     [11] => Contract length24 months
     [12] => EE via Buy Mobile Phones
     [13] => EE via Buy Mobile PhonesSee Deal 30GB data £ 36.00 per month Upfront costFreeTotal cost£804.00Contract length24 months£60 cashback 
     [14] => 9ba71fa19951ae2a2e23d2007a96b6ee
     [15] => 

Hi Steve,

Did you make any progress on this?