Quantcast
Channel: User Sefran - Stack Overflow
Viewing all articles
Browse latest Browse all 36

Build a text from html using xpath

$
0
0

I receive an html like that below from a server. I rebuild the textual part by using the XPath exp @"//text()" and appending the "nodeContent" value to a string. The code is something like this:

for (int i=2; i<[resultXPathQuery count]; i++) {    [mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:@"nodeContent"]];    [mytext appendString:@"\n"];}

I obtain:

Line 1line 2line 3 line 4

How could I build the textual part also considering the empty node?
I would to obtain:

Line 1line 2line 3line 4

<html><head><title>A title</title><style type="text/css">ol{margin:0;padding:0}p{margin:0}.c0{font-size:12pt;background-color:#ffffff;font-family:Times New Roman}.c6{width:432.0pt;background-color:#ffffff;padding:72.0pt 90.0pt 72.0pt 90.0pt}.c7{color:#aaaaaa;font-family:Times New Roman}.c3{color:#0000ee;text-decoration:underline}.c5{color:inherit;text-decoration:inherit}.c2{font-size:12pt;font-family:Times New Roman}.c4{height:12pt}.c1{direction:ltr}body{color:#000000;font-size:12pt;font-family:Times New Roman}h1{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:24pt;font-  family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}h2{padding-top:11.25pt;line-height:1.0;text-align:left;color:#000000;font-size:18pt;font-family:Times New Roman;font-weight:bold;padding-bottom:11.25pt}h3{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:14pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}h4{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:12pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}h5{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:9pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}h6{padding-top:18.0pt;line-height:1.0;text-align:left;color:#000000;font-size:8pt;font-family:Times New Roman;font-weight:bold;padding-bottom:18.0pt}</style></head><body class="c6"><p class="c1"><span class="c2">A title</span></p><p class="c1 c4"><span class="c2"></span></p><p class="c4 c1"><span class="c2"></span></p><p class="c1"><span class="c7">Line 1</span></p><p class="c1"><span class="c7">line 2</span></p><p class="c4 c1"><span class="c7"></span></p><p class="c1"><span class="c7">line 3</span></p><p class="c4 c1"><span class="c7"></span></p><p class="c4 c1"><span class="c7"></span></p><p class="c3 c2"><span class="c1"></span></p><p class="c1"><span class="c7">line 4</span></p></body></html>

EDIT

Really, I noticed that the html can be more "complicated", so it's not enough selecting all the span elements or p elements. Moreover, more span elements can appear in the same p element, so in that case I have not to create a new line in my string.

This is the body of a more complicated returned html:

<body class="c13"><p class="c5"><span>gfgfgfd</span></p><p class="c1"><span></span></p><p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p><p class="c1 c10"><span></span></p><p class="c4"><span>gfgfgfd</span></p><p class="c4"><span>f</span></p><p class="c4"><span>gfdgfdg</span><span class="c7">hg</span></p><p class="c4"><span class="c7">ghgfhgfh</span></p><p class="c4"><span class="c7">gfhgfhgf</span></p><p class="c5"><span class="c7">hgfh </span><span class="c0">gfdgfg</span></p><p class="c5"><span class="c0">fgfdgfdgfd</span></p><p class="c5"><span class="c0">gdfgdfgfd</span></p><p class="c5"><span class="c0">gfgf</span></p><p class="c1"><span class="c0"></span></p><p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p><p class="c1"><span class="c0"></span></p><p class="c5"><span class="c0">fgfdgfdg</span></p><p class="c5"><span class="c0">fgffgfdgfg</span><span class="c0 c11">gfgfdgfd fgd fd</span><span class="c0">fdgfdg</span></p><p class="c5"><span class="c0">fgfdgfdgf</span></p><p class="c5"><span class="c0">gfd</span></p><p class="c5"><span class="c0">gfgf</span></p><p class="c1"><span class="c0"></span></p><p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p><p class="c1"><span class="c0"></span></p><ol class="c9" start="1"><li class="c3"><span class="c0">gfgfd</span></li><li class="c3"><span class="c0">gfdgfd</span></li><li class="c3"><span class="c0">gfdgfd</span></li><li class="c3"><span class="c0">gdfgfd</span></li></ol><p class="c1"><span class="c0"></span></p><p class="c5"><span class="c0">hgfhgf</span></p><p class="c5"><span class="c0">gfhgfh</span></p><p class="c5"><span class="c0">hgfhgf</span></p><p class="c1"><span class="c0"></span></p><ol class="c2" start="1"><li class="c3"><span class="c0">gfhg</span></li><li class="c3"><span class="c0">hgfh</span></li><li class="c3"><span class="c0">hgf</span></li></ol><p class="c1"><span class="c0"></span></p><h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1><p class="c1"><span class="c6"></span></p><p class="c1"><span class="c6"></span></p><p class="c1"><span class="c6"></span></p></body>

I'd need an XPath expression that selects p, h1, h2,..., h6, li elements, and considers the inner textual part in such way that new line and empty lines are properly detected.


Viewing all articles
Browse latest Browse all 36

Trending Articles