Nepali text not being saved

tmtmtmtm · August 1, 2015, 12:38pm

With https://morph.io/tmtmtmtm/nepal-ca-members, data in Nepali is being lost when it’s written to the database. As you can see from the console, the script is correctly handling it, but ScraperWiki.save_sqlite seems to be losing it, as it doesn’t get stored to the database (I checked downloading the entire file in case it was just a display error).

I’ve certainly had to jump through interesting hoops in the past to be able to get ruby to correctly process text in some encodings, but as far as I can remember, once I got it processing correctly in ruby-land, I didn’t have to do anything special in morph/scraperwiki/sqlite-space with it, but perhaps there’s some option to something somewhere I should be setting, but can’t find?

Any suggestions?

equivalentideas · August 3, 2015, 6:13am

Hi @tmtmtmtm thanks for this!

Ive forked your scraper, and like yours, the Nepli text doesn’t display on the scraper page or appear in API output. But when I download the SQlite the characters are there. Weird.

Here’s that scraper https://morph.io/equivalentideas/nepal-ca-members

I’m not sure what’s going on, but I hope that helps a little. @henare any thoughts here?

equivalentideas · August 3, 2015, 7:16am

I can see the characters in your SQLite DB as well. So the characters are being collected, it’s just not displaying when it gets rendered in Ruby as csv, json or on the live site. This is a bug in morph.io , but the data is there in the scraper.

I’ve made an issue for this https://github.com/openaustralia/morph/issues/883 . We’ll get to onto this as soon as possible @tmtmtmtm thanks again for raising this.

henare · August 4, 2015, 1:41pm

OK I’ve looked into this a bit.

The reason you see it in the database and not in the morph.io interface is because of this magic we do to handle invalid UTF-8 data in the SQLite databases.

So that begs the question, why isn’t this being stored as UTF-8? After quite a bit of investigation it appears that open-uri and other libraries I’ve tried, mechanize, do not open the Google CSV file as UTF-8, they open it as ASCII-8BIT which then flows through to CSV and down to the SQLite database.

I tried a few quick hacks but I’ve not got a working solution yet.

henare · August 4, 2015, 11:34pm

@tmtmtmtm Worked it out. Pull request opened: https://github.com/tmtmtmtm/nepal-ca-members/pull/1

tmtmtmtm · August 5, 2015, 5:52am

Thanks — that seems to be working now. But I’m still very confused as to what was happening. If the data was making it to the database, why could it not come back out via the API?

I certainly appreciate the fix to the individual scraper, as it lets me get on in the meantime, but something seems slightly off somewhere deeper here. If anything, the original diagnosis seems back to front — if the data is getting into the database, but not back out, and if (as it looks at glance), the magic linked to above is only being applied on data read, and not write, then that would make me suspicious that that magic could be the cause of the failing case, rather than the successful one… [NB: I haven’t actually looked at the database code in any depth, and there could be all sorts of things going on elsewhere I haven’t seen — this is just at a glance]

henare · August 6, 2015, 1:12am

@tmtmtmtm yes, you could very well be right. The issue has been reopened to track this problem: https://github.com/openaustralia/morph/issues/883