findElement

Find elements in HTML tree

collapse all in page

Syntax

subtrees = findElement(tree,selector)

Description

example

subtrees = findElement(tree,selector) returns the elements in tree matching the CSS selector.

Examples

collapse all

Find Elements in HTML Tree

Open Live Script

Read HTML code from the URL https://www.mathworks.com/help/textanalytics using the webread function.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

Parse the HTML code using htmlTree.

tree = htmlTree(code);

Find all the hyperlinks in the HTML tree using findElement. The hyperlinks are nodes with element name "A".

selector = "A";
subtrees = findElement(tree,selector);

View the first few subtrees.

subtrees(1:10)

ans = 
  10×1 htmlTree:

    <A class="svg_link navbar-brand" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/company/aboutus/contact_us.html?s_tid=gn_cntus">Contact Us</A>
    <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A>
    <A class="svg_link pull-left" href="https://www.mathworks.com?s_tid=gn_logo"><IMG alt="MathWorks" class="mw_logo" src="/images/responsive/global/pic-header-mathworks-logo.svg"/></A>

Extract the text from the subtrees using extractHTMLText. The result contains the link text from each link on the page.

str = extractHTMLText(subtrees);
str(1:10)

ans = 10×1 string
    ""
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Contact Us"
    "Get MATLAB"
    ""

Input Arguments

collapse all

`tree` — HTML tree
scalar `htmlTree` object

HTML tree, specified as a scalar htmlTree object.

`selector` — CSS selector
string scalar | character vector

CSS selector, specified as a string scalar or a character vector. For more information, see CSS Selectors.

Output Arguments

collapse all

`subtrees` — Matching HTML subtrees
`htmlTree` array

Matching HTML subtrees, returned as an htmlTree array.

More About

collapse all

HTML Elements

A typical HTML element contains the following components:

Element name – Name of the HTML tag. The element name corresponds to the Name property of the HTML tree.
Attributes – Additional information about the tag. HTML attributes have the form name="value", where name and value denote the attribute name and value respectively. The attributes appear inside the opening HTML tag. To get the attribute values from an HTML tree, use getAttribute.
Content – Element content. The content appears between opening and closing HTML tags. The content can be text data or nested HTML elements. To extract the text from an htmlTree object, use extractHTMLText. To get the nested HTML elements of an htmlTree object, use the Children property.

For example, the HTML element <a href="https://www.mathworks.com">Home</a> comprises the following components:

Component		Value	Description
Element name		`a`	Element is a hyperlink
Attribute	Attribute name	`href`	Hyperlink reference
Attribute	Attribute value	`"https://www.mathworks.com"`	Hyperlink reference value
Content		`Home`	Text to display

CSS Selectors

CSS selectors specify patterns to match elements in a tree.

This table shows some examples showing how to extract different HTML elements from an HTML tree:

Task	CSS Selector	Example
Find all paragraph (`<p>`) elements.	`"p"`	`findElement(tree,"p")`
Find all paragraph (`<p>`) and list item (`<li>`) elements.	`"p,li"`	`findElement(tree,"p,li")`
Find all paragraph (`<p>`) elements that are inside table (`<table>`) elements.	`"table p"`	`findElement(tree,"table p")`
Find all hyperlink (`<a>`) elements with hyperlink reference attribute (`href`) values ending with `".pdf"`.	`"a[href$="".pdf""]"`	`findElement(tree,"a[href$="".pdf""]")`
Find all paragraph (`<p>`) elements that are the first child of their parent.	`"p:first-child"`	`findElement(tr,"p:first-child")`
Find all paragraph (`<p>`) elements that are the first paragraph element of their parent.	`"p:first-of-type"`	`findElement(tr,"p:first-of-type")`
Find all emphasis (`<em>`) elements where the parent is a paragraph (`<p>`) element.	`"p > em"`	`findElement(tr,"p > em")`
Find all paragraph (`<p>`) elements appearing immediately after a heading 1 (`<h1>`) element	`"h1 + p"`	`findElement(tr,"h1 + p")`
Find all empty elements.	`":empty"`	`findElement(tr,":empty")`
Find all nonempty label (`<label>`) elements.	`"label:not(:empty)"`	`findElement(tr,"label:not(:empty)")`

The findElement function supports all of CSS level 3, except for the selectors ":lang", ":checked", ":link", ":active", ":hover", ":focus", ":target", ":enabled", and ":disabled".

For more information about CSS selectors, see [1].

References

[1] CSS Selector Reference. https://www.w3schools.com/cssref/css_selectors.asp

Documentation

findElement

Syntax

Description

Examples

Find Elements in HTML Tree

Input Arguments

`tree` — HTML tree
scalar `htmlTree` object

`selector` — CSS selector
string scalar | character vector

Output Arguments

`subtrees` — Matching HTML subtrees
`htmlTree` array

More About

HTML Elements

CSS Selectors

References

See Also

Topics

Text Analytics Toolbox Documentation

Support

Documentation

findElement

Syntax

Description

Examples

Find Elements in HTML Tree

Input Arguments

tree — HTML tree scalar htmlTree object

selector — CSS selector string scalar | character vector

Output Arguments

subtrees — Matching HTML subtrees htmlTree array

More About

HTML Elements

CSS Selectors

References

See Also

Topics

Text Analytics Toolbox Documentation

Support

`tree` — HTML tree
scalar `htmlTree` object

`selector` — CSS selector
string scalar | character vector

`subtrees` — Matching HTML subtrees
`htmlTree` array