[解决] 使用lxml过滤HTML中class或id符合特定正则的元素

在看readability-lxml源码过程中遇到的问题, 很困惑为啥有部分应该在remove_unlikely_candidates(doc)过程中被删除的元素最后还在正文提取中出现.

源码是这样的:

def remove_unlikely_candidates(doc):
    for elem in doc.iter():
        s = "%s %s" % (elem.get('class', ''), elem.get('id', ''))
        #log.debug(s)
        if (REGEXES['unlikelyCandidatesRe'].search(s) and
                (not REGEXES['okMaybeItsACandidateRe'].search(s)) and
                elem.tag != 'body' and
                elem.getparent() is not None
                ):
            # log.debug("Removing unlikely candidate - %s" % describe(elem))
            elem.drop_tree()

原来以为挺简单的, 但是涉及到特殊情况比如<div class="unwanted">This <p>content</p> should go</div>这种含tail text即should go的, 或多级元素的, 就比较复杂了. 可能遍历到哪就自动中断了. 我这里写一个简化版本的作为测试:

from lxml.html import fromstring, tostring
import re

html = """
<html>
<head>
</head>

<body>
    <div>
        <div class="unwanted">This <p>content</p> should go</div>
        <p class="fine">This content should stay</p>
        What
    </div>

    <div id = "second" class="unwanted">
        <p class = "alreadydead">This content should not be looked at</p>
        <p class = "alreadydead">Nor should this</>
        <div class="alreadydead">
            <p class="alreadydead">Still dead</p>
        </div>
    </div>

    <div>
        <p class="yeswanted">This content should also stay</p>
    </div>

    <div id="unwanted">This <p>content</p> should go</div>
</body>
"""
doc = fromstring(html)

刚开始使用代码如下:

for elem in doc.iter():
    s = "%s %s" % (elem.get('class', ''), elem.get('id', ''))
    if re.compile('unwanted').search(s) and elem.tag != 'body' and elem.getparent() is not None:
        elem.drop_tree()
print(tostring(doc, pretty_print=True))

输出:

<html>
<head></head>
<body>
    <div><p class="fine">This content should stay</p> What</div>
    <div id="second" class="unwanted">\n
        <p class="alreadydead">This content should not be looked at</p>
        <p class="alreadydead">Nor should this&gt;</p>
        <div class="alreadydead"><p class="alreadydead">Still dead</p></div>
    </div>
    <div><p class="yeswanted">This content should also stay</p></div>
    <div class="unwanted">This <p>content</p>should go </div>
</body>
</html>

很明显, 该删除的都没删除.

以为是drop_tree()没删除tail text的关系, 改成elem.getparent().remove(elem), 执行结果还是一样...

i = 0
for elem in doc.iter():
    print(i, elem)
    i = i+1
    s = "%s %s" % (elem.get('class', ''), elem.get('id', ''))
    if re.compile('unwanted').search(s) and elem.tag != 'body' and elem.getparent() is not None:
        elem.getparent().remove(elem)

尝试将elem打印出来, 发现只有5条

0 <Element html at 0x2d5baf0>
1 <Element head at 0x314f200>
2 <Element body at 0x3376780>
3 <Element div at 0x33767d8>
4 <Element div at 0x314f200>
5 <Element p at 0x3376780>

想到iter()应该是使用了迭代器模式. 因此应该有next()方法. 而在删除掉节点的同时, 相应对象的next()引用也被删除了. 因此才只遍历了第一部分. 具体怎么遍历的还不是很清楚.

使用

allelem = doc.iter()
print(dir(allelem))

输出

['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__lt__', '__ne__', '__new__', '__next__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

其中确实有__next__, 最后使用如下代码

allelem = doc.iter()
for elem in allelem:
    s = "%s %s" % (elem.get('class', ''), elem.get('id', ''))
    if re.compile('unwanted').search(s) and elem.tag != 'body' and elem.getparent() is not None:
          for i in range(len(elem.findall('.//*'))):
              allelem.__next__()
          elem.getparent().remove(elem)
print(tostring(doc, pretty_print=True), file=output)

终于正确输出想要的HTML:

<html>
<head></head>
<body>
    <div>
        <p class="fine">This content should stay</p>What</div>
    <div>
        <p class="yeswanted">This content should also stay</p>
    </div>
</body>
</html>

还有另外一种方式, 但没有第一种方法快:

for elem in doc.iter():
    s = "%s %s" % (elem.get('class', ''), elem.get('id', ''))
    if re.compile('unwanted').search(s) and elem.tag != 'body' and elem.getparent() is not None:
        elem.set('class', 'removethis')
for elem in doc.find_class('removethis'):
    elem.getparent().remove(elem)
print(tostring(doc, pretty_print=True), file=output)

这是我一开始想到的方式, 前面一种是参考了Stack Overflow后总结出来的. 因为提问者给出的代码中直接用了next(), 而我发现没有这个方法, 一开始没用它, 后面用dir看了一下发现有__next__()这个方法, 官方文档很多地方挺奇怪的..很无语.

最后将源码修改为:

def remove_unlikely_candidates(doc):
    allelem = doc.iter()
    for elem in allelem:
        s = "%s %s" % (elem.get('class', ''), elem.get('id', ''))
        if REGEXES['unlikelyCandidatesRe'].search(s) and (not REGEXES['okMaybeItsACandidateRe'].search(s)) and elem.tag != 'body' and elem.getparent() is not None:
            for i in range(len(elem.findall('.//*'))):
                allelem.__next__()
            elem.getparent().remove(elem)

终于解决了纠结好久的问题, 原本以为源码不会有问题的...

参考文档:

Stack Overflow: Editing tree in place while iterating in lxml