With https://morph.io/tmtmtmtm/nepal-ca-members, data in Nepali is being lost when it’s written to the database. As you can see from the console, the script is correctly handling it, but ScraperWiki.save_sqlite seems to be losing it, as it doesn’t get stored to the database (I checked downloading the entire file in case it was just a display error).
I’ve certainly had to jump through interesting hoops in the past to be able to get ruby to correctly process text in some encodings, but as far as I can remember, once I got it processing correctly in ruby-land, I didn’t have to do anything special in morph/scraperwiki/sqlite-space with it, but perhaps there’s some option to something somewhere I should be setting, but can’t find?
I can see the characters in your SQLite DB as well. So the characters are being collected, it’s just not displaying when it gets rendered in Ruby as csv, json or on the live site. This is a bug in morph.io , but the data is there in the scraper.
The reason you see it in the database and not in the morph.io interface is because of this magic we do to handle invalid UTF-8 data in the SQLite databases.
So that begs the question, why isn’t this being stored as UTF-8? After quite a bit of investigation it appears that open-uri and other libraries I’ve tried, mechanize, do not open the Google CSV file as UTF-8, they open it as ASCII-8BIT which then flows through to CSV and down to the SQLite database.
I tried a few quick hacks but I’ve not got a working solution yet.
Thanks — that seems to be working now. But I’m still very confused as to what was happening. If the data was making it to the database, why could it not come back out via the API?
I certainly appreciate the fix to the individual scraper, as it lets me get on in the meantime, but something seems slightly off somewhere deeper here. If anything, the original diagnosis seems back to front — if the data is getting into the database, but not back out, and if (as it looks at glance), the magic linked to above is only being applied on data read, and not write, then that would make me suspicious that that magic could be the cause of the failing case, rather than the successful one… [NB: I haven’t actually looked at the database code in any depth, and there could be all sorts of things going on elsewhere I haven’t seen — this is just at a glance]