BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.
I have this html:
<div id="content">
<h1>Title 1</h1><br><br>
<h2>Sub-Title 1</h2>
<br><br>
Description 1.<br><br>Description 2.
<br><br>
<h2>Sub-Title 2</h2>
<br><br>
Description 1<br>Description 2<br>
<br><br>
<div class="infobox">
<font style="color:#000000"><b>Information Title</b></font>
<br><br>Long Information Text
</div>
</div>
I want to get all text in <div id="content">
with XPath in Scrapy but excluding <div class="infobox">
's content, so the expected result is like this:
Title 1
Sub-Title 1
Descripton 1.
Descripton 2.
Sub-Title 2
Descripton 1.
Descripton 2.
But I haven't reached the excluding part yet, I'm still struggling to grab the text from the <div id="content">
.
I have tried this:
response.xpath('//*[@id="content"]/text()').extract()
But it only returns Description 1.
and Description 2.
from both Sub-Title.
Then I tried:
response.xpath('//*[@id="content"]//*/text()').extract()
It only returns Title 1
, Sub-Title 1
, Sub-Title 2
, Information Title
, and Long Information Text
.
So there are two questions here:
content
div?infobox
div from the selection?Use the descendant::
axis to find descendant text nodes, and state explicitly that the parent of those text nodes must not be a div[@class='infobox']
element.
Turning the above into an XPath expression:
//div[@id = 'content']/descendant::text()[not(parent::div/@class='infobox')]
Then, the result is similar to (I tested with an online XPath tool) the following. As you can see, the text content of div[@class='infobox']
does no longer appear in the result.
-----------------------
Title 1
-----------------------
-----------------------
Sub-Title 1
-----------------------
-----------------------
Description 1.
-----------------------
Description 2.
-----------------------
-----------------------
Sub-Title 2
-----------------------
-----------------------
Description 1
-----------------------
Description 2
-----------------------
-----------------------
-----------------------
What is wrong with your approaches?
Your first attempt:
//*[@id="content"]/text()
in plain English, means:
Look for any element (not necessarily a
div
) anywhere in the document, that has an attribute@id
, its value being "content". For this element, return all its _immediate child text nodes_.
Problem: You are losing the text nodes that are not an immediate child of the outer div
, since they are inside a child element of that div
.
Your second attempt:
//*[@id="content"]//*/text()
Translates to:
Look for any element (not necessarily a
div
) anywhere in the document, that has an attribute@id
, its value being "content". For this element, find any descendant element node and return all text nodes of that descendant element.
Problem: You are losing the immediate child text nodes of the div
, since you are only looking at text nodes that are children of elements that are descendants of the div
.
EDIT:
Responding to your comment:
//div[@id = 'content']/descendant::text()[not(ancestor::div/@class='infobox')]
For your future questions, please make sure the HTML you show is _representative_ of your actual problems.