Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


How to select all children text but excluding a tag with Scapy's XPath?

Question

I have this html:

<div id="content">
    <h1>Title 1</h1><br><br>

    <h2>Sub-Title 1</h2>
    <br><br>
    Description 1.<br><br>Description 2.
    <br><br>

    <h2>Sub-Title 2</h2>
    <br><br>
    Description 1<br>Description 2<br>
    <br><br>

    <div class="infobox">
        <font style="color:#000000"><b>Information Title</b></font>
        <br><br>Long Information Text
    </div>
</div>

I want to get all text in <div id="content"> with XPath in Scrapy but excluding <div class="infobox">'s content, so the expected result is like this:

Title 1


Sub-Title 1


Descripton 1.

Descripton 2.


Sub-Title 2


Descripton 1.
Descripton 2.

But I haven't reached the excluding part yet, I'm still struggling to grab the text from the <div id="content">.

I have tried this:

response.xpath('//*[@id="content"]/text()').extract()

But it only returns Description 1. and Description 2. from both Sub-Title.

Then I tried:

response.xpath('//*[@id="content"]//*/text()').extract()

It only returns Title 1, Sub-Title 1, Sub-Title 2, Information Title, and Long Information Text.


So there are two questions here:

  1. How could I get all of children text from content div?
  2. How to exclude the infobox div from the selection?

Answer

Use the descendant:: axis to find descendant text nodes, and state explicitly that the parent of those text nodes must not be a div[@class='infobox'] element.

Turning the above into an XPath expression:

//div[@id = 'content']/descendant::text()[not(parent::div/@class='infobox')]

Then, the result is similar to (I tested with an online XPath tool) the following. As you can see, the text content of div[@class='infobox'] does no longer appear in the result.

-----------------------
Title 1
-----------------------
-----------------------
Sub-Title 1
-----------------------
-----------------------
Description 1.
-----------------------
Description 2.
-----------------------
-----------------------
Sub-Title 2
-----------------------
-----------------------
Description 1
-----------------------
Description 2
-----------------------
-----------------------
-----------------------

What is wrong with your approaches?

Your first attempt:

//*[@id="content"]/text()

in plain English, means:

Look for any element (not necessarily a div) anywhere in the document, that has an attribute @id, its value being "content". For this element, return all its _immediate child text nodes_.

Problem: You are losing the text nodes that are not an immediate child of the outer div, since they are inside a child element of that div.


Your second attempt:

//*[@id="content"]//*/text()

Translates to:

Look for any element (not necessarily a div) anywhere in the document, that has an attribute @id, its value being "content". For this element, find any descendant element node and return all text nodes of that descendant element.

Problem: You are losing the immediate child text nodes of the div, since you are only looking at text nodes that are children of elements that are descendants of the div.


EDIT:

Responding to your comment:

//div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infobox')]

For your future questions, please make sure the HTML you show is _representative_ of your actual problems.

cc by-sa 3.0