Dealing with <dl> , <dt> and <dd> tags - Ruby

Hello. I’m trying to learn scraping about doing the one-day course and am stuck on a test wikipedia scrape.



I need to do a loop through each letter to get the name and description. I can’t figure out how to change my loop to go to each “mw-headline” and loop through both the dt and the dd.

The structure is essentially:
<span class=“mw-headline” list of characters starting with “a”>

name 1
desc 1
name 2
desc 2
name 1
desc 1
name 2
desc 2

Hey @edmundtadros thanks for raising this in here.

Am I understanding that you want to capture the name and description of each character?

Looking at the page, each <dt> contains the name, and then the next <dd> contains the description. You can loop through all the <dt> and then capture the name and description using the next_element method provided by Nokogiri (the markup parser that Mechanize imports by default).

Try something like:'#mw-content-text').search('dt').each do |dt|
  character = {
    name: dt.text,
    description: dt.next_element.text

  p character
  ScraperWiki.save_sqlite([:name], character)

Bonus points

You might have noticed that the <dt> sometimes includes some notes about who portrayed the character. You might not want to capture this as part of the character name. There are a number of ways to process Strings in Ruby. Have a look at the Ruby methods for String such as split. You can use split to split bits of a String into an Array, and then select which bit you want. Because they use different kinds of dashes and different kinds of whitespace characters to separate the name from the notes throughout the list, you can’t just say .split(" – ")[0]. Instead you use a regular expression:

name: dt.text.split(/\W(-|–|—)\W/)[0]

You can test out regular expressions in this handy website Rubular.

Now you can also select the notes and save them if you want :sparkles:.

There are links to some of this stuff in the workshop notes.

1 Like

Thanks very much. That all seems to work now. I added in a note field to save the third part of the array the split creates.