Problem:
I need to do a loop through each letter to get the name and description. I can’t figure out how to change my loop to go to each “mw-headline” and loop through both the dt and the dd.
The structure is essentially:
<span class=“mw-headline” list of characters starting with “a”>
Am I understanding that you want to capture the name and description of each character?
Looking at the page, each <dt> contains the name, and then the next <dd> contains the description. You can loop through all the <dt> and then capture the name and description using the next_element method provided by Nokogiri (the markup parser that Mechanize imports by default).
Try something like:
page.at('#mw-content-text').search('dt').each do |dt|
character = {
name: dt.text,
description: dt.next_element.text
}
p character
ScraperWiki.save_sqlite([:name], character)
end
Bonus points
You might have noticed that the <dt> sometimes includes some notes about who portrayed the character. You might not want to capture this as part of the character name. There are a number of ways to process Strings in Ruby. Have a look at the Ruby methods for String such as split. You can use split to split bits of a String into an Array, and then select which bit you want. Because they use different kinds of dashes and different kinds of whitespace characters to separate the name from the notes throughout the list, you can’t just say .split(" – ")[0]. Instead you use a regular expression: